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Pr ef ace 


The  purpose  of  this  research  was  to  develop  a  real-time, 
continuous  speech  recognition  system.  Since  many  computer 
algorithms  existed  in  the  AFIT  Signal  Processing  Laboratory 
to  characterize  various  aspects  of  human  speech,  there  was  a 
need  to  combine  these  programs  into  a  viable  system. 

The  system  described  in  this  paper  provides  an 
interactive  means  of  continuous  speech  recognition  by 
computer.  The  system  is  speaker-dependent  and  requires 
training  v i  '  'i  a  70-word  vocabulary  prior  to  word 
r ecogn i t i on . 
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A  speech 

recogni  t 

ion  system 

was  designed 

and 

implemented 

t  o 

recognize 

cont inuous 

speech  in  a  real 

time 

environment 

(after  train 

ing).  Several  techniques 

were 

incorporated  to  characterize  phonemes  as  vectors  in  space. 
Through  the  use  of  distance  rules  it  was  possible  to 
characterize  words  by  a  phoneme  representation,  which  could 
subsequently  be  used  in  word  recognition.  This  approach  to 
speech  recognition  offers  several  possibilities  for  future 
investigations  such  as  varying  the  Minkowski  distances 


and  applying  clustering  techniques.  The  algorithm  developed 
was  modularized  on  a  hierarchical  basis  and  was  user 
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IMPLEMENTATION  OF  A  REAL-TIME,  INTERACTIVE, 
CONTINUOUS  SPEECH  RECOGNITION  SYSTEM 

I. 

Speech  recognition  by  computer  has  seen  dramatic  gains 
in  the  past  five  years.  Resulting  primarily  from  increased 
computer  processing  power,  efficient  recognition  algorithms 
can  perform  near  real  time.  However,  recognition  systems 
using  these  algorithms  are  constrained  to  isolated  word 
recognition  and  have  limited  vocabularies  of  between  twenty 
and  one  hundred  words.  The  major  emphasis  of  present  speech 
research  is  the  development  of  an  algorithm  capable  of 
recognizing  natural,  continuous  speech  (6:69). 

The  goal  of  speech  research  is  to  determine  a  decoding 
scheme  similar  to  that  of  the  human  brain.  Many  continuous 
speech  recognition  algorithms  exist,  yet  none  has  a 
significant  word  recognition  rate.  There  are  three  primary 
reasons  for  this,  all  of  which  are  a  consequence  of  our  lack 
of  understanding  of  how  the  human  brain  deals  with  •  the 
complex  operation  of  speech  recognition.  First,  the  speech 
signal  is  highly  encoded  in  the  brain,  and  one  must  acquire 
an  understanding  of  how  such  characteristics  as  intonation, 
articulation,  semantics,  and  syntax  are  encoded  before  one 
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TDATA2  =  (sample  length ) *800  ,  (2) 

where  sample  length  is  the  duration  in  seconds  of  each  audio 
input.  Variable,  IDATA1 ,  specifies  the  locations  where 
conversion  values  are  placed.  In  FORTRAN',  I  DATA  1  is  an 
integer  array  placed  in  a  labeled  common  block..  The  block 
label  is  the  same  as  that  given  in  the  S AMGEN  configuration 
files. 


Octal  Value 

F  u  n  c t  ion 

0  0  0  0  0  K 

pulse  clock 

2  0  0  0  0  K 

DCH  clock 

A  0  0  0  0  K 

internal  clock 

6n000K 

external  clock 

0- 1 700K 

start  channel  0-13 

0-  1  7K 

final  channel  0-13 

Table  3.  Octal  Values  for  Bit  Setting  of  ID AT A  1  (1:69) 

Configuration  files,  produced  by  an  interactive  dialog 
with  the  program  SAMGEN ,  define  operating  system  hardware 
and  operation  nodes.  For  example,  SAMC0NFIG3  has  the  source 
i  n 


f  i  1 


s  h  own 


Figure  3 . 


The  twelve  hits  in  the  machine  word  produce  2  ,  or  s  0  0  6 
different  conversion  values.  Over  a  range  of  10  volts,  this 
means  that  each  value  increment  represents  .024  volts. 
I'sing  a  one-to-one  correspondence  between  sampled  values  and 
integer  values,  the  real  value  for  voltage  can  be  determined 
by  the  equation 

Real  Value  =  Float( Integer  Value)/32768.*5.  (1) 

Variables  (IDATAx  words)  are  passed  to  the  software 
routines  to  represent  channel  use  numbers,  conversion  count, 
clock.  source  and  storage  locations  for  conversion  values. 
I D  A T  A  1  ,  occupying  one  machine  word,  represents  clock  source 
and  channel  use  numbers.  There  are  four  clock  sources 
available  to  the  device:  pulse,  DCH,  internal  and  external. 
Though  all  four  .ran  be  used  for  A  /  f)  conversion,  only  the 
external  clock  is  accessed.  Channel  numbers  are  given  as 
starting  and  ending  channels.  If  both  are  the  same  only  one 
channel  is  specified.  The  bit  values  of  T  DATA  1  can  be  set 
using  the  the  octal  values  in  Table  2.  Since  an  external 
clock  and  all  sixteen  .channels  are  used,  ID  AT  A  1  equals 
6  I  7  0  0  K  . 

IDATA2,  also  occupying  one  machine  word,  specifies  the 
total  conversion  count  or  the  number  of  data  samples.  Thus, 
using  an  800  Hz  external  .lock,  I  DATA  2  is  computed  by 


A/D  Conversion 


A/D  conversion  is  performed  by  making  use  of  the  model 
•*331  Eclipse  analog  device  and  independent  software 
interfaces.  Though  both  A/D  and  D/A  capabilities  are 
present,  only  the  former  is  addressed. 

The  Eclipse  A/D  conversion  device  operates  using  device 
op- codes  and  conversion  data  buffers.  Organized  around  a 
single  12-bit  converter  and  two  multiplexers,  the  device  can 
accept  up  to  16  different  input  signals  with  voltage  levels 
in  a  +  5  V  range.  The  conversion  values  are  stored  as  a 
machine  word  with  the  bit  assignments  shown  in  Table  2. 
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Assigned  Value 

o 

sign 

i  - 1 1 

storage  values 

12-13 

zero 

ment  for  Control  Words 


f  (Hz) 
n 


B  a  nd w i d  t  h ( H  z ) 


2  60 
3  90 
520 
650 
780 
9  1  0 
1060 
1  220 

1  400 
1600 
1820 
20  70 

2  3  7  0 
30  3  5 


1  30 
l  30 
1  30 
1  30 
1  30 

1  40 
1  60 
180 
200 

2  2  0 
2  5  0 
300 
340 

10  30 
1  4  -«  5 
2005 


Approximate 

f 

1 

Band  Coverage 

f 

h 

2  0  3 

333 

330 

460 

4  5  9 

5  89 

588 

7  1  8 

7  1  8 

8  48 

8  4  3 

98  3 

983 

1143 

113  3 

13  13 

1  303 

1503 

I  4  9  4 

17  13 

1699 

1949 

192  5 

2  2  2  5 

2  2  06 

2  5  46 

Constant  voltage  level  inputs  to  the  ASA-16  .'hip  are 
maintained  by  the  automatic  gain  control  circuit.  Designed 
to  accept  20  mV  to  2  0  V  over  a  60  d  B  dynamic  range,  the  AGC 
outputs  voltages  of  1  V  to  -»V,  peak  to  peak. 

To  define  band  energies,  the  ASA-16  chip  uses  sixteen 
channels,  each  composed  of  a  second-order  bandpass  filter, 
half-wave  rectifier  and  a  low-pass  filter.  Each  channel  is 
sequentially  accessed  using  a  sampled-and-held  multiplexer. 
Table  1  shows  the  center  frequencies  and  bands  associated 
with  each  filter.  A  TT1.  crystal  controlled  1  MHz  clock  is 
used  for  timing.  Since  the  processor  is  built  on  a  single 
board,  both  the  chip  and  the  board  are  powered  by  a  common  + 
10  V  power  supply. 


B  u  f  f  e  r  s 

are  used 

between 

the  chip  outputs 

and  the 

423  1 

Eel  ipse 

A/D 

con ver ter 

inputs  , 

to  correct  DC 

offset 

and 

satisfy 

the 

2  DOkohm 

termination  resistance 

that 

each 

bandpass  filter  requires  (5:8-41). 
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:2£1:££^  Processing 


The  word  recognition  algorithm  developed  in  this  thesis 
is  based  on  the  output  of  the  acoustic  processor  designed  by 
Ajmal  Hussain  (5).  Built  on  a  single  board,  the  processor 
conditions  the  analog  speech  prior  to  analog/digital 
conversion.  This  chapter  discusses  the  transformation  of 
speech  trom  an  analog  signal  to  a  set  of  feature  vectors 
used  in  pattern  recognition. 

Processor 

Prior  to  A/D  conversion,  the  audio  signal  is  precondi¬ 
tioned  to  preserve  frequency  components  and  limit  signal 
voltage  levels.  The  primary  component  of  this  conditioning 
process  is  the  ASA-16  spectrum  analyzer  chip.  Figure  2 
illustrates  what  happens  to  the  speech  after  entering  the 
system  via  a  dynamic  microphone  and  preamplification. 

A  preemphasis  filter  and  an  active  low-pass  filter  band 
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circuit  which  follows. 


levels  for  the  automatic  gain  control 
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two  components  of  the 


machine : 


first, 


the 


acoustic 
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processor  (developed  by  Hussain)  which  provides  a  vector 
representation  of  the  speech  signal,  and  the  recognition 
algorithm  for  determining  a  phoneme  representation  for  each 
speaker,  establishing  templates  for  each  word  in  the 
vocabulary  (Appendix  A),  and  finding  words  in  a  continuous 
speech  phrase.  More  detailed  outlines  of  these  components 
are  described  in  subsequent  chapters. 

§nd  ?32iE5!22i. 

The  following  materials  and  equipment  will  be  used: 

1.  Data  General  Eclipse  S/250  Computer, 

2.  Filter  Bank  -  Acoustic  Processor. 

52322222  £2252222  2.22 

The  information  contained  in  this  report  is  presented 
in  the  following  manner.  Chapter  2  provides  an  explaination 
of  speech  processing  both  the  digitization  and  frequency 
sampling  of  the  speech  signal.  Chapter  5  describes  the 
specifics  of  word  recognition.  N>xt ,  Chapter  4  is  a 
synopsis  of  algorithm  design.  Finally,  recognition  results 
and  appropriate  conclusions  and  recommendations  are 


presented  in  chapters  5  and  6,  respectively. 


front  ends  have  been  developed  using  fast  Fourier  transforms 
and  filter  banks.  Spectrograms  of  speech  waveforms  have 
also  provided  information  as  to  the  possible  waveform 
characterization  using  energy  components.  From  these 
investigations,  Seelandt  (10)  developed  a  universal  feature 
set  representing  phonemes  found  in  digitized  speech.  Then, 
Seelandt,  Montgomery  (7),  and  Hussain  (5),  suceeded  in 
designing  algorithms  to  recognize  both  isolated  and 
connected  speech  by  use  of  distance  rules  and  template 
matching . 

The  purpose  of  this  thesis  is  to  combine  several 
techniques  developed  at  the  AFIT  Signal  Processing 
Laboratory  with  additional  modifications  to  produce  a  viable 
speech  recognition  system.  A  complete  system,  one  in  which 
analog  speech  is  input  and  the  recognized  words  are  output  , 
did  not  previously  exist. 

Two  primary  considerations  in  the  design  of  this  system 
were  modularization  and  user  friendliness.  Modularity 
provides  easy  modification  to  system  components,  and  user 
friendliness  provides  an  operational  machine  which  can  be 
demonstrated  by  someone  who  may  not  be  a  computer  expert. 

General.  Approach 

The  speech  recognition  system  developed  from  this 
research  is  described  in  Figure  1.  Basically,  there  are 


This  means  that  phonemes  at  word  boundaries  are  transitional 
and  do  not  reveal  much  word  information. 

Summary  of  Current  Know 1  «dge 

Research  reports  show  extensive  studies  in  the  area  of 
continuous  speech  recognition  (4).  Two  methods  currently 
being  used  for  continuous  speech  recognition  are 
segmentation  and  whole-word  template  matching.  Segmentation 
is  a  means  by  which  the  acoustic  signal  is  divided  into 
unique  sound  units.  After  the  word  has  been  segmented,  an 
attempt  is  made  to  match  the  sequence  of  units  to  a 
particular  vocabulary  word.  The  problem,  though,  is  that 
some  phonemes  are  lost  in  the  process  of  chopping  the  signal 
into  time  slices.  The  other  method,  whole-word  template 
matching,  is  means  of  matching  whole  word  phoneme 
representations  to  the  strings  of  phonemes  produced  by  an 
acoustic  processor.  Problems  inherent  in  this  technique  due 
to  the  variability  of  the  word's  duration  when  spoken.  In 
the  effort  to  find  a  solution  some  success  has  been  achieved 
through  the  characterization  of  a  word  by  phonemes  from  the 
body  of  the  word.  Thus,  eliminating  transition  problems 
(4:574). 

The  development  of  a  speech  recognition  machine  has 
been  an  important  area  of  research  at  the  Air  Force 
Institute  of  Technology  for  several  years.  Several  acoustic 


features  such  as  formant  frequencies,  correlation 
coefficients,  or  linear  predictive  coefficients  for  pattern 
matching.  Once  the  acoust  signal  is  encoded,  templates 
are  constructed  for  each  word.  These  templates  are 
subsequently  matched  to  encoded  speech  for  word  hypothesis. 
Although  proven  effective  for  isolated  speech,  this  method 
is  difficult  to  apply  to  continuous  speech  recognition  which 
has  undefined  word  boundaries  and  variable  word  durations. 
To  remedy  this  situation,  algorithms  incorporate  partial 
template  matching  techniques.  Thus,  the  object  of  this 
research  was  the  development  of  a  continuous  speech 
recognition  system  relying  on  a  universal  feature  set  and 
partial  template  matching. 

Scope 

The  computer  algorithm  developed  in  this  thesis  was 
designed  to  recognize  speaker  dependent,  continuous  speech. 
Some  considerations  for  acoustic  processor  error  are  made  by 
calculating  error  statistics  and  incorporating  them  into  the 
word  determination  phase  of  the  algorithm. 

As  1 

This  research  was  based  on  two  assumptions.  First,  the 
output  of  the  acoustic  processor  is  reliable  such  that 
phoneme  choices  are  consistent  best  guesses.  Second,  words 
can  be  defined  by  an  incomplete  phoneme  representation. 


can  hope  to  perform  artificial  decoding  Second,  speech  is  a 
variable  signal.  Each  time  a  speaker  says  a  particular 
word,  the  phoneme  representation  of  the  word  differs 
slightly.  To  the  human  ear,  the  variation  is  almost 
indistinguishable,  but  a  machine  must  rely  upon 
mathematical  exactitudes.  Therefore,  any  d i sc r epanc i e s  are 
magnified  in  mecanical  speech  processing.  Finally,  although 
the  brain  somehow  can  distinguish  the  separations  between 
words,  these  word  boundaries  are  difficult  for  machines  to 
find  because  the  acoustic  signal  has  no  apparent  pauses 
(4:570). 

Fast,  reliable  man-machine  communication  is  becoming  a 
necessity  as  computers  become  integrated  into  today's 
society.  Speech,  the  most  natural  form  of  human 
communication,  seems  appropriate  in  this  application  (6:64). 
Already,  current  testing  onboard  the  Air  Force's  AFTI  F-16, 
has  proven  that  speech  recognition  is  a  definite  aid  to  a 
pilot  taxed  to  his  physical  limits  by  mechanical  tasks. 

Problem 

The  major  problem  in  speech  recognition  lies  not  in 
the  characterization  of  the  acoustic  signal,  but  in  the 
determination  of  a  decision  scheme  which  uses  these 
characteristics  for  word  recognition.  Several  decision 
schemes  have  been  developed  which  use  acoustic  signal 


In  the  figure  items  worthy  of  particular  attention  are 
the  device  code  number  for  A/D  ( IDS2 1 ) ,  the  number  of  pages 
in  the  data  channel  (16  blocks  x  1024  words/block  =  16384 

words)  and  the  data  channel  starting  address  IB OFF  (area  in 
common  memory)  (l:60-7u). 

To  perform  A/D  conversion,  the  sequence  of  software 
c  ommand  s  is 

EXTERNAL  IDS21 
EXTERNAL  IDS23 
COMMON/ I  BUFF/ I  DATA  3  (  1  6  384  ) 

COMMON/ IB UFO/ ID AT AO 
DIMENSION  IORBA( 16) 

CALL  D  STRT (  IER)  . 

CALL  D0IT[/W](I0RBA, device-id, 8, ID AT Al  ,  I DATA2  ,  I DATA3  , 
IER)  . 

By  way  of  explaination,  the  first  few  lines  declare  both  the 
device  identification  numbers  and  the  common  areas.  After 
that,  DSTRT  performs  initialization,  and  DOIT  requests  the 
conversion  operation.  Errors  are  reported  by  IER  values  as 
shown  in  Table  4.  Any  additional  error  is  reported  bv 


IORBA ( i 4 ) . 

r  f 

I0RBA(  1  4  ) 

does  not  equal 

4  0  0  0  0  K 

after 

conversi on  * 

a  n 

external 

clock  interrupt 

or  a 

c  lock 

overrun/ underrun  has  occurred  (1:78-83).  A  complete  listing 
of  this  procedure  can  be  found  in  the  ATOD.FR  source  code  in 
Appendix  C. 
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Vectors  are  sequentially  ordered  in  the  data  array, 

TDAT.4  3.  Each  vector  represents  a  time  slice  in  which  all  9 

sixteen  filters  have  been  sampled.  The  sampling  rate  per 
vector  is  25  Hz,  or  once  every  40  msec. 
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As  described  in  the  introduction,  word  recognition  in 
continuous  speech  consists  of  making  a  phoneme  template, 
constructing  word  templates  from  the  phoneme  template,  and 
matching  the  word  templates  to  an  input  phoneme  string. 
This  chapter  discusses  the  process  in  more  detail. 

Since  as  early  as  19  4  7,'  phonemes  have  been  used  in  word 
recognition.  Potter,  Kopp,  and  Green  (8)  found  that  sounds 
had  unique  spectrograms,  and  that  people  who  they  had 
trained  to  visually  read  these  spectrograms  could  identify 
sounds  in  a  word  spectrogram.  In  other  words,  distinct 
phoneme  patterns  were  discernable  in  word  spectrograms.  In 
1981,  Seelandt  (10)  investigated  digitized  speech 
spectrograms  and  the  concept  of  phoneme  patterns  emerged. 
By  combining  several  time  slices  or  vectors  of  digitized 
speech,  he  produced  a  set  of  seventy  phonemes.  The  concept 
was  carried  a  step  further  by  Hussain  (5)  who  produced 
s i ng 1 e - v e c t o r  s i x t e e n - d i me n s  i  o n a  1  phonemes.  The  method  for 
phoneme  generation  developed  for  this  system  compares  and 
averages  these  sixteen-dimensional  vectors  to  produce  a  set 
of  less  than  seventy  phonemes.  The  comparison  and  averaging 
continue  until  a  phoneme  set  is  defined.  This  technique  is 
explained  more  clearly  later  in  this  chapter. 
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Distance  Rules 


Since  they  are  suited  to  defining  the  spatial 
orientation  of  n-dimensional  vectors,  distance  rules  are  key 
to  vector  metrics.  An  understanding  of  vector  relationships 
allows  one  to  reduce  large  vector  sets  and  approximate 
vector  strings  by  known  vectors. 

Many  pattern  recognition  algorithms  incorporate  metrics 
based  upon  Minkowski  distances.  Table  5  presents  some  of 
these  different  calculations. 

The  Minkowski  distance  between  two  vectors  is  computed 
by  the  following  procedure.  The  absolute  difference  between 
corresponding  vector  elements  is  calculated  and  then  raised 
to  the  Minkowski  power.  Next,  these  values  are  summed  and 
the  resulting  sum  raised  to  (1/Minkowski  power).  The  final 
value  is  the  vector  score,  or  the  numerical  relationship 
between  the  two  vectors.  This  algorithm  uses  Minkowski  4 
which  empahasizes  the  effect  of  any  large  discrepancies 
between  compared  vectors  (5). 
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Rabiner,  et  a  1  .  (9)  should  that  speech  vectors  in  space 
tend  to  cluster.  Using  this  concept,  his  research  group 
tried  clustering  analysis  on  phonemes,  words,  and  speakers. 
Th“  experiments  were  successful  for  isolated  word 
recognition,  but  clustering  became  a  computational  nightmare 
in  continuous  speech  recognition.  As  a  result  clustering  was 
not  emphasized  in  this  research.  However,  some  spatial 
comparisons  are  performed. 

By  visualizing  the  speech  vectors  as  residing  in  n  - 
dimensional  space,  one  can  imagine  that  some  vectors  lie 
closer  together  than  others.  Therefore,  close  vectors  can 
be  used  to  define  regions  of  space,  or  clusters.  As 
previously  mentioned  this  algorithm  redefines  nearest 
vectors  by  averaging  and  weighting.  Each  vector  has  an 
initial  weight  of  one,  and  those  vectors  which  are  closest 
to  one  another  by  Minkwosk i  distance  are  averaged  together 
by  the  equation 


(X  *  weight(m))  +  (  X  *  weight (n) ) 


weight(m)  +  weight(n) 


The  highest  - numbered  vector  is  deleted  and  a  new  weight  is 
assigned  to  the  new  vector  b  y 


=  weight(m)  +  weight,  (n) 
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weight(m) 


The  resulting  set  of  vectors  represents  clusters  or  similar 
sounds  . 

X®!D£i§Hf- 

Bv  incorporating  the  above  mentioned  concepts,  a 
phoneme  template  is  made  for  each  speaker.  This  template,  a 
set  of  energy  normalized  vectors,  represents  the  most  unique 
sounds  produced  by  that  speaker.  The  process  is: 

1.  Vector  normalization 

2.  Vector  deletion 

3.  Vector  comparison 

4.  Vector  averaging. 

Vector  normalization  consists  of  noise  removal  and 
energy  normalization.  Since  background  noise  is  present  in 

all  environments  except  an  anechoic  chamber,  its  removal 
from  the  audio  signal  allows  for  better  characterization  of 
the  signal.  The  first  vector  in  each  speech  file  is 
considered  to  represent  the  average  background  noise. 
Therefore,  by  subtracting  this  vector  from  each  of  the 
remaining  vectors,  'the  noise  is  removed.  To  insure  voltage 
consistency,  a  noise  threshold  voltage  of  30. 3  millivolts 
(corresponding  to  the  average  laboratory  background  noise) 
is  set.  Each  element  in  a  vector  is  checked  to  insure 
'•oltages  above  the  noise  threshold.  If  the  value  of  the 


element  voltage  falls  below  this  threshold,  the  elements 
voltage  is  redefined  as  zero.  Additionally,  the  vectors  are 
energy  normalized,  by  dividing  each  vector  component  by  the 
vector's  energy,  calculated  by 
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These  energy  normalized  v  ctors  are  now  considered  phonemes. 

On  the  basis  of  energy  thresholding,  some  vectors  are 
omitted  from  the  speech  file.  The  energy  threshold  is 
determined  by  finding  the  average  energy  per  vector  over  the 
test  utterance.  The  equation  is 


E  )/(number  of  vectors) 


Once  normalization  and  deletion  have  been  performed, 
the  phonemes  are  compared  to  one  another  by  Minkowski  e 
distance  to  determine  each  vectors  nearest  neighbor.  Then, 
if  the  total  number  of  vectors  is  greater  than  sixty -nine, 
the  two  nearest  neighbors  .1  r  averaged  together  using  the 
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averaging  routine  described  earlier  (see  Clustering,  p  .  2  0  )  , 
and  replaced  by  a  single  vector  which  is  then  energy 
normalized.  After  all  nearest  neighbor  averaging  have  been 
completed,  a  check  is  made  to  see  if  the  number  of  remaining 
vectors  is  less  than  seventy  (the  maximum  number  of  phonemes 
that  the  system  can  handle).  If  the  number  of  phonemes  is 
greater  than  that,  the  entire  comparison/averaging  process 
is  repeated. 

The  final  outcome  of  this  procedure  is  a  set  of 
vectors,  each  of  which  represents  a  particular  phoneme. 
This  vector  set  is  referred  to  hearafter  as  the  phoneme 
template.  Each  phoneme  is  then  assigned  a  number  from  one 
to  at  most  sixty-nine. 


Word 

After  the  phoneme  set  is  established,  it  is  used  Lo 
develop  a  codebook  of  phoneme  representations  for  each  word 
in  the  given  vocabulary  (Appendix  A).  The  process 
components  are; 


Normalization 


Phoneme  extraction 


Word  representation  by  phonemes 


Vocabulary  creation 


As  is  the  case  in  the  creation  of  phoneme  templates,  the 
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vectors  in  the  digitized  speech  of  each  word  are  normalized. 

At  this  point  phoneme  extraction  takes  place.  Each 
vector  in  the  phoneme  template  is  compared  to  each  vector  in 
the  vector  string  of  the  input  speech.  By  calculating  the 
Minkowski  -t  distance  between  a  template  vector  and  a  string 
vector,  the  closest  (minimum  distance)  template  vector  is 
determined.  Then  the  phoneme  number  associated  with  the 
template  vector  is  placed  in  an  array,  and  the  process  is 
repeated  until  all  string  vectors  are  represented  by  phoneme 
numbers . 

Following  the  process  of  string  representation,  the 
array  of  phonemes  is  compressed  to  at  most  ten  phonemes  in 
the  following  manner.  First,  identical  adjacent  phonemes 
are  represented  as  one  phoneme.  Then,  the  distances  between 
adjacent  phonemes  are  obtained  from  a  distance  matrix. 
Adjacent  phonemes  with  distances  between  them  of  less  than 
11.0  (distance  measures  are  normalized  to  100)  are 
compressed  by  deleting  the  highest  numbered  phonemes. 

The  final  outcome  of  this  process  is  a  string  of 
phonemes  representative  of  a  vocabulary  word.  Collectively, 
these  strings  represent  the  entire  vocabulary.  Table  6 
shows  the  phoneme  represenatat  ions  of  several  words. 


Table  6  . 

Word  Spotting 

Word  spotting 

continuous  speech, 
templates,  speech 

following  procedure: 


Normal ization 


2.  Phoneme  extraction 

3.  Compression 
Comparison. 

The  procedure  for  normalization,  extraction,  ini 

compression  of  input  speech  is  identical  to  that  which 

followed  in  the  creation  of  word  templates.  Once  the  input 

speech  is  represented  as  a  string  of  template  phonemes, 

recognition  is  attempted. 

The  recognition  scheme,  based  on  the  scheme  described 

by  Hussain  (5),  is  as  follows.  After  having  undergone 

phoneme  representation,  the  input  string  of  phonemes  is 
searched  for  two  or  more  adjacent  zeros.  Since  zero 
phonemes  represent  low  energy  noise  phonemes,  adjacent  zero 
phonemes  are  considered  word  boundaries.  When  two  word 

boundaries  are  found,  the  next  step  is  to  establish  the 
number  of  phonemes  between  the  boundaries.  less  than  three 

phonemes  are  considered  noise,  more  than  eight  phonemes 
means  that  two  words  have  been  spoken. 

One  word  identification  is  p  e  r  f  o  r  m  e  d  by  finding  the 
distance  (Minkowski  e)  between  the  the  first  word  template 
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Distance  Matrix  Creation 
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vectors  could  have  the  same  nearest  neighbor,  it  is 
necessary  to  find  pairs  of  vectors  which  are  each  other's 
nearest  neighbors.  This  requires  a  considerable  amount  of 
time,  but  it  is  important  for  accurate  analysis  of  vector 
metrics.  After  a  nearest  neighbor  pair  is  found,  the  vector 
elements  of  the  lowest-numbered  vector  are  recalculated 
using  Equation  3  and  energy  normalized.  The  weight  of  each 
vector  is  recalculated  by  Equation  4,  and  the  weight  of  the 
highest-numbered  vector  is  set  to  zero.  The  process  is 
repeated  until  all  nearest  neighbor  pairs  have  been  found. 
These  vectors  are  then  compressed  by  the  subroutine  CMPRS1, 
and  the  number  of  remaining  vectors  is  examined.  If  this 
number  is  greater  than  sixty-nine,  new  nearest  neighbors  are 
determined  and  the  averaging  process  is  repeated.  This 
entire  operation  process  can  take  anywhere  from  five  to  ten 
minutes . 

After  the  correct  number  of  vectors  has  been  obtained, 
the  voltage  values  are  stored  in  the  array  P  H  0  N  and  passed 
to  TRAIN,  which  has  options  for  writing  them  to  a  file 
(PHONE)  and  reading  them  from  a  file  (PHONE).  (Figure  5 
depicts  the  voltage  elements  for  each  vector  in  the  phoneme 
template  file.  The  vectors  are  horizontally  arranged  such 
that  phoneme  1  is  positioned  (1,1),  (1,2), 
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Figure  s  .  Energy  in  Data  Vectors 


In  the  COMPARE  subroutine,  the  distances  between  all 


pairs  of  vector's  are  determined  by  Minkowski 


distance 


■,i1  rillal  inns.  Each  vector's  nearest  neighbor  is  determined 


by  Hussain's  algorithm.  This  concept  is  carried  a 


step 


further  by  the  i  n  t  r  o  d  u  •  t  i  mi  of  vector  weighting.  Since  m;iny 


into  the  array.  Then  the  average  energy  of  all  the  vectors 
(  5  0  <1  vectors  =  10  seconds  of  speech)  is  calculated  for 
future  use  as  a  threshold.  Several  alternative  energy 
thresholds  were  considered  before  the  average  energy  was 
considered  an  optimum  threshold. 

After  energy  normalizing  each  vector  (by  dividing  each 
vector  element  by  the  entire  vector's  energy),  those 
vectors  with  energy  below  the  threshold  energy  are  deleted. 
This  is  accomplished  in  the  subroutine  I,  OWENERGY  which  also 
assigns  a  weight  of  one  to  each  vector  prior  to  deletion. 
Deleted  vectors  are  then  weighted  zero.  (This  weighting 
system  dramatically  reduces  the  number  of  calculations 
previously  used  by  Hussain  to  delete  vectors).  Vectors 
with  weights  of  zero  are  bubble  sorted  to  the  end  of  the 
data  array  and  the  number  of  sixteen-dimensional  vectors 
left  is  calculated  and  stored  in  position  I  DATA (  I DATA2  +  1  ) 
(See  CMPRSI.FR,  Appendix  C).  If  this  number  is  less  than 
seventy  the  program  returns  to  TRAIN.  If  not  it  calls  the 
subroutine  C.  OMPARF.  ,  which  is  a  comparison  and  averaging 


rout i n  e . 


U*  '»■  l  <  i 


program  can  digitize.  This  is  especially  important  to  the 
various  phases  of  the  speech  recognition  system.  A  SWAP 
calls  each  program  for  digitizing  ten  seconds  of  speech  for 
phoneme  templaLe  creation  (AT0D10),  two  seconds  of  speech 
for  word  phoneme  represenations  (AT0D2),  and  five  seconds  of 
speech  for  speech  recognition  (  A  T  0  D  5  )  .  Each  program  creates 
a  file  (OUT2,OL'T5,OUT10,  respectively)  in  which  data 
conversion  values  or  integer  voltages  are  stored.  The  data 
are  stored  sequentially,  such  that  every  set  of  sixteen 
integer  values  represents  a  vector  of  speech. 

An  important  requirement  for  the  correct  processing  of 
speech  is  that  the  input  speech  be  between  +5  and  -5  volts. 
This  level  should  be  checked  on  the  o s c i 1 1 o s c o p e . 

X2!DEl2£.2  Creation 

A  prompt  from  TRAIN  requires  the  user  to  input  ten 
seconds  of  speech,  from  which  the  set  of  phonemes  will  be 
derived.  When  the  user  says  a  predetermined  phrase  into  a 
microphone,  ATOD10  produces  the  file  OUTIO  containing  500 
sixteen  dimensional  vectors.  A  call  is  then  made  to  the 
subroutine  TEMPLATE,  which  directs  the  creation  of  the 
phoneme  template.  01IT10  is  opened  and  read  into  a  common 
buffer  in  which  the  sampled  data  points  are  stored  in  the 
array  TDATA.  Then  the  energy  in  each  vector  is  calculated 
and  stored  in  the  array  T ENERGY.  (Figure  A  is  a  printout  of 
this  array)  Each  vector's  energy  is  entered  sequentially 


•  1 
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for  use  in  the  system  described  herein  required  a 
considerable  amount  of  modifications.  Many  of  Hussain’s 
individual  programs  appear  as  subroutines  in  the  current 
program,  making  it  necessary  to  provide  strict  continuity 
among  them,  and  to  develop  interface  routines  so  they  can 
become  efficient  contributors  to  a  unified  whole. 
Furthermore,  Hussain's  programs  require  extensive 
interaction  between  the  user  and  the  machine,  often  with  the 
human  performing  tasks  which  could  be  better  handled 
automatically  (e.g.  the  creation  of  phoneme  templates  which 
involves  manually  removing  undesirable  phonemes).  Hussain's 
results  could  not  be  reproduced  by  other  experimenters 
because  his  choice  of  values  for  some  constants  used  in  the 
algorithms  was  often  completely  arbitrary  (at  least  the 
method  for  determining  them  was  not  revealed).  In  the 
interest  of  a  more  scientific  approach,  an  attempt  tias  been 
made  in  this  thesis  to  derive  values  for  these  constants 
(energy  threshold,  distance  threshold,  error  constants)  from 
a  more  logical  basis. 

A/D  ^221(2222:22 

After  the  analog  speech  has  gone  through  the 
preprocessor,  it  is  converted  to  a  digital  signal  by  one  of 
three  programs:  AT0D2,  A T  0  D  5  ,  and  A  T  0  D 1 0 .  The  numbers  in 
the  program  titles  indicate  how  many  seconds  ot  speech  each 

If. 
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From  the  very  start,  this  research  effort  was  hampered 
by  time  constraints.  The  thesis  report  which  served  as  a 
starting  point  for  this  particular  investigation  was 
incomplete  and  contained  vague  documentation  making  it 
necessary  to  do  some  preliminary  analysis  in  order  to  define 
what  had  already  been  accomplished.  This  coupled  with  the 
sheer  magnitude  of  the  task  of  developing  a  complete, 
operational  speech  recognition  system  where  none  had  existed 
previously,  resulted  in  the  expiration  of  available  time 
before  a  thorough  quantitative  evaluation  of  the  system's 
performance  could  be  made.  Such  an  evaluation  would  require 
the  compilation  of  large  amounts  of  data  and  systematic 
tuning  of  parameters  in  search  of  the  optimal  setting  for 
the  control  dials.  The  fine  tuning,  although  important,  is 
not  really  essential  to  the  validation  of  this  system  as 
"operational",  and  so  the  discussion  in  this  chapter  will 
focus  on  the  performance  of  the  different  components  and  the 
way  they  are  integrated  into  the  system. 

The  programs  developed  by  Ajmal  Hussain  (5)  provided 
the  foundation  for  most  of  this  research  effort.  (The  major 
thrust  of  the  present  treatment  is  to  take  the  various 
individual  parts  of  the  puzzle  provided  by  Hussain (5)  and 
others  (10)  (7)  and  combine  them  into  a  complete  speech 
recognition  system.)  However  the  adaptation  of  his  programs 
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<:  lose  ness,  value  until  there  are  at  most  ten  phonemes  left 
in  the  string.  [An  aside:  I.  ow  energy  vectors  are  considered 
noise  and  are  represented  as  zero.  Two  adjacent  zeroes  are 
compressed  to  one  zero,  and  one  zero  by  itself  is  ignored, 
as  are  all  zeroes  appearing  at  the  begining  of  the  string 
(5).]  Tne  array  SOUND,  containing  these  phonemes,  is  then 
entered  into  the  appropriate  location  in  the  matrix  VOCAB. 
After  all  of  the  vocabulary  words  have  been  processed, 
PRIMTREP  prints  the  words  along  with  their  phoneme 
representations . 


location 
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A  call  to  PRINTDIS  provides  a  printout  of  the  distances 
between  phonemes,  normalized  to  100. 

After  these  two  steps  are  completed,  TRAIN  determines 
phoneme  representations  for  the  words  in  the  vocabulary. 
Vocabulary  words  are  flashed  onto  the  video  screen,  one  at  a 
time.  A  swap  to  AT0D2  enables  the  A/D  conversion  of  two 
seconds  of  speech.  Conversion  values  are  subsequently  stored 
in  the  file,  0  U  T  2  . 

Next,  REP  is  called  to  read  the  eight  blocks  (2048 
conversion  values)  into  a  buffer.  As  in  TEMP,  every  16 
values  represents  a  vector  of  sampled  speech.  After 
normalization  (NORMALIZE),  phonemes  are  extracted  (EXTRACT) 
from  the  speech  file.  This  means  that  the  speech  file  is 
represented  as  a  string  of  250  phonemes.  CMPRS  goes  through 
the  phoneme  string  comparing  each  pair  of  adjacent  phonemes. 
If  the  two  are  identical,  they  are  compressed  into  one,  if 
they  are  dissimilar  the  distance  between  them  is  examined, 
and  in  the  case  of  a  distance  within  the  threshold,  the 
phoneme  having  the  higher  number  is  deleted  from  the  string. 
Tf  the  distance  lies  outside  the  threshold,  no  action  is 
taken  and  the  next  pair  of  phonemes  is  examined.  CMPRS 
continues  reducing  the  phoneme  string  by  increasing  the 


Creating  a  phoneme  template  requires  digit:  zed  speech 
input  and  numerous  calculations  and  comparisons.  TRAIN 
requests  that  the  user  provide  10  seconds  of  input  speech.  A 
swap  with  TRAIN  calls  AT0D10,  which  performs  the  A/D 
conversion  and  stores  the  data  conversion  values  into  a  file 
named  0UT10.  Then  TEMPLATE  reads  16  blocks  (8192  conversion 
values)  from  the  file  into  a  buffer.  Every  sixteen 
conversion  values  represents  a  s i x t e e n - d i me n s i o n a  1  vector  of 
speech.  The  vectors  are  nomalized  (NORMALIZE),  checked  for 
low  energy  (LOW  ENERGY),  and  compressed  (CMPRS1)  by  removing 
the  low  energy  vectors.  A  comparison  and  averaging  routine 
called  COMPARE  fin !s  each  vector's  nearest  neighbor  vector. 
Subsequently,  the  two  nearest  neighbors  are  averaged  and, 
following  compression  (CMPRS1),  a  new  set  of  vectors  is 
defined.  This  process  repeats  until  a  maximum  of  sixty-nine 
vectors  is  left.  (Seelandt  (10)  showed  70  phonemes  was  3 
viable  number  for  characterizing  sounds  in  speech).  These 
remaining  vectors  are  stored  in  the  array  PHON  for  future 
use.  The  value  PHON (1121)  is  the  total  number  of  vector 
phonemes . 

Next,  TRAIN  calls  DISTANCE,  which  creates  a  lower 
triangular  matrix  of  the  distances  between  phonemes.  The 
matrix  is  stored  in  array  DTS,  in  which  the  location  between 
two  phonemes  m  and  n  (where  m>n)  can  be  found  by  the 


The  computer  algorithms  designed  for  use  in  this  thesis 
were  designed  modularly  on  a  hierarchical  basis.  The 
motivation  for  this  was  the  requirement  for  ease  in  system 
modification.  The  various  components  can  be  changed  to 
accommodate  various  methods  of  phoneme  template  creation, 
word  representation,  and  continuous  speech  recognition. 
Therefore,  this  thesis  provides  a  valuable  tool  for  future 
research  in  this  area. 

The  primary  driving  program  of  the  system  is  TRAIN. 
Having  two  phases  (training  and  recognition),  .  it  ties  all 
of  the  components  of  a  speaker-dependent  recognition  system 
together.  This  chapter  provides  a  discussion  of  these 
components  . 

(NOTE:  In  addition  to  the  main  speech  recognition 
algorithms,  several  support  algorithms  were  written  for  A/D 
conversion,  file  manipulation,  and  creation  of  a  vocabulary 
of  words.  The  source  codes  for  these  algorithms  are  in 
Appendix  C.] 

Trai^ni.ng 

Training  is  accomplished  through  subroutine  calls  from 
the  program  TRAIN.  In  other  words,  TRAIN  has  a  training  mode 
in  which  a  phoneme  template,  a  distance  matrix,  and  a  word 
phoneme  template  are  made.  Each  mode  is  a  separate 


word  recognition  is  then  performed  as  described  above. 
Finally,  error  statistics  are  calculated  for  the  entire 
string.  For  continuous  recognition  the  entire  process  is 
repeated  by  choosing  the  second  best  guess  for  the  first 
word,  then  the  third  best  guess  and  so  on.  Folowing  the 
final  iteration,  the  string  having  the  least  total  error  is 
accepted  as  the  correctly  recognized  sequence  of  words.  The 
entire  process  repeats  for  the  next  phrase  indicated  by  a 
set  of  adjacent  zeroes. 

T  m  e 

Because  phonemes  are  compared  on  a  one-to-one  basis 
after  compression,  word  duration  is  ignored  and  time 
alignment  is  unnecessary. 
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Once  the  phoneme  template  and  the  distance  matrix  are 
formed,  the  system  can  start  making  word  phoneme 
representations.  TRAIN  reads  from  a  file  the  list  of 
vocabulary  words  and  stores  them  as  Hollerith  strings  in  the 
array  named  (appropriately)  words.  The  vocabulary  used  in 
this  thesis  is  that  shown  in  Appendix  A.  TRAIN'S  next  step 
is  to  loop  through  the  vocabulary  words  to  find  the  phoneme 
representation  of  each. 

The  user  is  prompted  to  enter  two  seconds  of  speech  by 
saying  the  word  which  appears  on  the  computer  video  screen 
during  the  allotted  time  window.  A  swap  to  AT0D2  produces 
the  file  0  U  T  2  containing  the  125  digitized  speech  vectors. 

TRAIN  then  calls  the  subroutine  REP,  which  controls  the 
procedure*  for  finding  that  word's  phoneme  representation. 
As  was  the  case  in  the  creation  of  phoneme  templates,  the 
data  vectors  are  normalized  and  the  low  energy  vectors  are 
deleted  using  the  previously  determined  threshold.  Each 
data  vector  is  then  compared  to  the  complete  set  of  phoneme 
vectors  by  calculating  the  distance  between  the  data  vector 
and  each  of  the  phoneme  vectors.  After  this  the  phoneme 
vector  whose  distance  calculation  produced  the  smallest 
value  is  used  to  represent  the  string  vector.  This  is 
repeated  for  the  entire  string  of  speech  vectors,  forming  an 
array  of  numbers  representing  the  closest  phoneme  vectors. 
This  string  is  then  rompresesed  (CMPRS)  by  replacing 
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identical  phonemes  with  a  single  phoneme  and  deleting  the 
highest-numbered  adjacent  phonemes  whose  distances  are  below 
the  threshold  distance,  which  is  10.  A  check  is  made  to  see 
if  the  resulting  string  has  a  maximum  of  ten  phonemes.  If 
not,  the  distance  threshold  is  increased  by  0.5  and  the 
string  is  compressed  again.  (Note:  the  choice  of  10  as  a 
distance  threshold  is  arbitrary  as  is  the  choice  of  a 
maximum  of  ten  phonemes  for  word  representation.  A 
maximizing  routine  could  probably  be  incorporated  here  to 
more  accurately  determine  the  values  for  these  limits). 
Figure  7  shows  the  string  of  numbers  representing  the  input 
speech  vectors,  before  and  after  compression. 
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Phoneme  String  Before  and  After  Compression 
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After  the  phoneme  representations  for  all  the 
vocabulary  words  have  been  found,  they  are  placed  in  the 
array  VOCAB.  TRAIN  then  writes  them  to  the  file  VOCABUL , 
and  reads  them  from  the  same.  TRAIN  also  prints  the 
vocabulary  words,  along  with  their  phoneme  representations 
on  the  line  printer.  Figure  8  shows  an  example  of  this 
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Figure  8.  Word  Phoneme  Representations 
from  TRA  TV 
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TRAIN  enters  the  recogint  t ion  mode  after  the  training 
session  is  completed,  and  prompts  the  user  for  a  five  second 
input  of  speech.  Once  again  a  SWAP  call  is  used  for  A/D 
conversion  and  a  file  (0 UT5)  is  written  to  contain  sampled 
data  points.  The  governing  subroutine  SPEECH  then  calls  the 
necessary  routines  to  read  the  data  into  the  common  buffer, 


normalize  the  data  vectors. 


and  compress  the  phoneme 


number  string,  and  finally,  perform  speech  recognition.  All 
the  steps  prior  to  recognition  are  the  same  as  those  in  word 
phoneme  representation,  with  the  exception  of  compression. 
Here,  compression  occurs  only  once,  as  there  is  no  minimum 
number  of  phonemes  required.  The  system  is  now  ready  to 
enter  the  recognition  mode. 

The  method  for  recognition  has  alredy  been  discussed 
(see  Word  Recognition).  Figure  9  shows  a  string  of 

phoneme  numbers  representing  five  seconds  of  input  speech. 
The  second  string  represents  the  same  speech  vectors  after 
compression  has  been  performed.  The  zeroes  represent  points 

of  low  energy.  If  a  zero  appears  alone,  it  is  ignored  in  the 

recognition  phase.  If  two  are  side-by-side,  the  recognition 
routine  recognizes  it  as  a  word  boundary,  and  starts 
counting  the  number  of  subsequent  phonemes.  If  less  than 
two  phonemes  are  counted,  the  routine  decides  that  it  his 

made  a  mistake  and  starts  counting  again  at  the  next 
boundary.  If  it  counts  more  than  nine  a  I  j  a  c  e  n  t  phonemes,  it 


'■  .■'■V-V-V-’.--’  •’ 


assumes  that,  more  than  one  word  is  present,  so  it  goes  into 
the  continuous  recognition  mode. 


Figure  9.  Phoneme  String  of  Five  Seconds  of  Speech, 
Before  and  After  Compression 


To  perform  word  recognition,  phonemes  are  compared  to 
string  numbers  and  the  best  fitting  vocabulary  words 
determined,  based  on  the  distances  between  phonemes.  The 
algorithm's  best  guesses  are  printed  on  the  video  screen  and 
line  printer  attached  to  the  Eclipse. 

Figure  10  shows  a  sample  phoneme  string  before  and 
after  compression,  along  with  the  algorithm's  best  guess. 
In  this  example,  the  word  spoken  was  "ONE"  and  was  spoken  by 
the  system's  trainer  (a  female).  The  computer  recognized 
the  word  correctly  as  "ONE",  but  then  apparently  went  on  to 


identify  ambient  noise  as  "EIGHT". 


Figure  11  shows 

a  simi 

lar  event; 

however,  in 

th  i 

s  case 

the  word  was  "AIR-TO- 

AIR," 

spoken 

by 

a  male  speaker 

The 

minor  inaccuracy  of 

the 

mach i ne ' 

s 

recognition 

( 

"CLEAR 

CLEAR")  in  this  case 

can  be 

attributed 

to  the  fact 

that  the 

system  had  not  been  re 

t  r  a i ne 

d  to  the 

male  speaker's 

V  o 

ice. 

Another  male  speaker  trained  the  system  and  entered  the 
word  "AIR-TO-AIR".  Both  his  phoneme  string  before  and  after 
compression  are  shown  in  Figure  12,  along  with  the  computer 
algorithm's  best  guess. 

Similar  tests  were  performed  for  continuous  speech 
recognition.  The  female  speaker  who  originally  trained  the 
system  spoke  the  phrase,  "ONE  TWO  THREE."  Then,  using  her 
training  templates,  the  computer  algorithm  recognized  "AFT 
AFT  ONE  AFT  FUEL  AFT  THREAT  AFT  AFT".  Analyzing  this  string 
of  words,  one  could  expect  the  word  "AFT"  to  represent 
ambient  noise.  The  confusability  of  "FUEL"  for  "TWO"  and 
"THREAT"  for  "THREE"  means  that  not  all  the  sounds  present 
in  each  word  are  identified  or  matched.  TViose  sounds  with 
higher  energy  are  detected,  while  softer  sounds  or  lower 
energy  phonemes  may  be  deleted  as  a  result  of  the  energy 
threshold. 
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Iii  summary,  several  trials  were  made  using  the  system 
trained  by  both  a  male  and  a  female  speaker.  The  machine 
obviously  has  problems  dealing  with  ambient  noise,  and 
usually  associates  its  best  guess  phoneme  string  with  ".AFT" 
or  "EIGHT".  Also  stops  and  pauses  between  syllables  such  as 
in  "WAYPOIMT"  appear  as  zeroes.  If  the  pause  duration  was 
too  long  the  system  determined  that  two  words  were  spoken 
instead  of  one,  and  an  attempt  at  best  fit  was  made  for  two 
words.  The  other  trend  of  confusing  similar  sounding  words 
reveals  that  vowels  are  easily  spotted,  however,  low  energy 
fricatives  are  difficult  to  detect.  This  could  possibly  be 
remedied  by  experimenting  with  voltage  thresholding,  or  the 
system  might  be  made  to  be  self  adjusting.  The  limitation 
of  phoneme  representations  to  10  phonemes  may  be  too 
restrictive,  and  so  can  be  introducing  an  excessive  number 


of  errors  by 

taxing  the 

systems'  ; 

inability 

t  o 

properly 

detect  words 

which  are 

not  well 

SHI  ted 

for 

such  a 

compression. 


V  I  . 


Cone lus ions 


Although  ths  system  has  not  been  thoroughly  tested, 
several  conclusions  can  still  be  drawn  from  the  data  already 
accumulated  concerning  this  speech  recognition  system.  Tn 
addition,  some  specific  recommendations  can  be  given  for 
system  improvement.  This  chapter  summarizes  these  two 
areas  . 

Summary 

The  speech  recognition  system  which  was  designed  and 
implemented  was  speaker  dependent  and  operated  near  real 
time  (after  training).  Several  techniques  were  incorporated 
to  characterize  phonemes  as  vectors  in  space.  Through  the 
use  of  distance  rules  it  was  possible  to  characterize  words 
by  a  phoneme  representation,  which  could  subsequently  be 
used  in  word  recognition.  This  approach  to  speech 
recognition  offers  several  possibilities  for  future 
invest igai . on  such  as  varying  the  Minkowski  distances,  and 
the  application  of  clustering  techniques. 

The  system  which  was  designed  was  user  friendly, 
providing  a  system  of  recognition  which  needed  little  user 
interaction.  The  instructions  were  kept  to  a  minimum  and 
made  easy  to  understand,  thereby  taking  a  lot  of  the  guess 


work  out  of  trying  to 


iimI'T  stand  the  programmer's  jargon. 


The  modularity  of  the  system  also  proved  useful  by 
j 1  lowing  easy  modification  through  the  use  of  comment 
:  h  a  r  a  c  t  e  r  s  .  A  n  v  stage  of  recognition  can  be  changed  by 
simply  altering  a  subroutine.  Another  advantage  to  the  way 
this  system  was  designed  is  that  the  variables  which  ire 
passed  are  kept  to  a  minimum,  thus  reducing  the  number  of 
items  which  must  be  accounted  for. 

The  objective  of  designing  a  speech  recognition  system 
capable  of  operating  in  real  time  was  met  by  this  research 
effort.  The  resulting  system  uses  a  phoneme  set  unique  to  a 
particular  speaker  and  partial  template  matching  for 
continuous  word  recognition. 


Re.ommendati.ons 

Several  recommendations  can  be  made  for  improving 

system  performance.  First,  the  method  of  energy 

thresholding  should  be  investigated  to  determine  an  optimum 

range  of  threshold  values,  since  the  present  method  allows 
some  of  the  low  energy  vowels  and  most  consonants  to  be 

deleted.  Second,  word  phoneme  representations  probably 

should  not  be  limited  to  a  length  of  ten.  Instead,  natural 
breaks  should  be  retained.  Third,  other  methods  of  word 
boundary  detection  could  take  the  arbitrariness  out  of  the 

present  system.  Specifically,  the  use  of  two  adjacent  zero 
vectors  as  an  arbitrary  w  o  r  d - w  o  r  d  boundary  could  be 


energy 


Perform  speech 
recognition  on 
5  seconds  of 


Create  distance 
matrix  between 


Program 
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This  system  was  intended  to  present  a  model  for 

friendly  continuous  speech  recognition  system.  Though 
by  no  means  a  perfect  system,  it  should  provide  a 
foundation  for  future  studies. 


changed  to  a  more  fuzzy  determination.  Finally,  branching 
techniques  could  be  used  to  predict  phoneme  transitions. 

The  continuation  of  this  research  should  not  be 

difficult.  Investigations  into  energy  thresholding  should 
be  made  so  that  the  system  might  become  self-adjusting. 
Additionally,  other  methods  (besides  Hussain's  choice  of 

Minkowski  4)  of  distance  calculations  should  be  studied  to 

which  will  yield  the  best  results.  Also,  the  compression 
technique  for  extracted  phonemes  could  be  investigated  to 

determine  an  optimum  threshold  for  adjacent  phoneme  distance 
and  to  see  if  deleting  the  highest-numbered  vector  is  really 
necessary.  Montgomery(8)  applied  techniques  involving 

average  branching  factors  for  phoneme  transitions  which 
could  easily  be  incorporated  into  this  system.  Though 
ambient  noise  was  not  removed  in  the  manner  originally 
described,  the  effect  of  background  noise  on  energy 
thresholded  vectors  is  another  subject  worthy  of  further 
investigation.  The  choice  of  vocabulary  for  this  system  may 
not  have  been  the  best  suited  for  it.  Perhaps  the  phonetic 
alphabet  (ALPHA,  BRAVO,  CHARLIE,...)  would  have  more 
validity  as  these  characters  are  easily  distinguishable, 
having  been  designed  for  low  eonf  usability  in  the  presence 
of  high  levels  of  background  noise.  Finally,  the  entire 
system  would  probably  work  more  quickly  if  implemented  on  an 
array  processor,  since  most  of  the  calculations  incorporate 

vector  combinatorics. 
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APPENDIX  C 

COMPUTER  PROGRAM  LISTINGS 


A  £  p  2  n  d  i  x  Cj_  Comjouter  P  jt  o  g  r  a  m  Listings 
The  source  code  for  all  computer  programs  and  support 
routines,  written  in  FORTRAN  V,  is  included  in  this 

Appendix.  Two  macrofiles  ATODMC  and  T  R  A I  N  M  0  are  used  to 

load  all  routines  necessary  for  speech  recognition.  These 
programs  appear  in  the  following  sequence: 


PROGRAM  Page 

ATODMC . MC  . 76 

AT0D2.FR  . 77 

AT0D5.FR  . 79 

ATOD  1  0  .  FR  . . . 8  1 

DIS5.FR  . 82 

PHONS.FR  . 8-> 

RF.PS.FR  . 85 

TRAINMC . MC  . 86 

CMPRS.FR  . 8  7 

CMPRS l.FR  . 89 

COMPARE.FR  . 90 

DISTANCE.FR  . 92 

EXTRACT.  FR  . 92 

FINDWORD.FR  . 95 

L0WENER0Y  .  FR  . 97 

NEWSCR.FR  . 98 

N0RMAT.IZE.FR  . 99 


PR  I  NTT)  I  S  .  FR 


PR  I NTRE  P . FR 
RECOG . FR 
REDBUF.FR  . 
REDDIS.FR  . 
REDPHON . FR 
REDREP.FR  . 
REDVRDS . FR 
RF.P  .  FR  .... 

SPEECH.FR  . 
TEMPLATE . FR 
TRAIN. FR  . . 
VTYPE . FR 
WRTDIS.FR  . 
WRTPHON . FR 
WRTREP.FR  . 


VOCAR . FR 


******  A TODMC. MO  ******************************************** 

Function:  load  A  TOD  programs  required  by  TRAIN'. 


RLDR/P  2/K  ATOD  1  0  SAMCONFIG3  1SAMI I  B<? 
RI.DR/P  2/K  AT0D5  SAMCONFIG3  ^  S  AML  I  B<? 
RLDR/P  2/K  ATOD  2  SAMCON F I G  3  ^SAMLIB? 


Title: 
Author  : 
Date: 


AT0D2 . FR 

[ Lt  Kathy  R.  Dixon 
Nov  84 


Function:  performs  A/D  on  Eclipse  for  5  sec 

Command  Line : 


Rl.DR/P  2/K  AT0D2  SAMC0NFTG3  !?  S  AML  I  B  ? 


i-n'k-k-kit-k’k-k'k'k-kit'. 


;**********★************ 


EXTERNAL  IDS21  [external  input  device 

EXTERNAL  IDS23  [external  output  device 

COMMON/ I  BUFF/ IDATA3 (  1  6  384  )  [input  data  buffer 

COMMON/ IBUFO/IWAST  [Output  data  buffer 

INTEGER  IORBA/ 1 6) , DEVICE 

DEV  I CE  =  2 1 
IDATA 1 -6 1 700K 
IDATA2- 1 600 

TYPE" <CR> 
start<BELXCR>" 

CALL  DSTRT(IER)  [initialize  A/D  d-vice 

IF/ IER.NE. 1 )CALL  ERROR/ "DSTRT  ERROR") 

CALL  DOTTW(IORBA,IDS21,8, IDATA 1  , I D AT A 2  , I D AT A  3  ,  I  ER  ) 
IF( IER.NE. 1 )TYPE"D0IT  ERROR ",IER 

IF(IORBA( 14) .NE.40000K)TYPE"IORBA( 14)  RETURN", 

I O  R  B  A  (  1  4  ) 


[input  device 
[external  clock 
[conversion  count 


TYPE" <CR> 
stop<BELXCR>" 

CALL  DF I LW ( " 0UT2 "  ,  I ER ) 

IF( (IER.NE.  1  )  .AND. (  IER.NE.  1  3)  )TYPE"DFILW  ERROR ",IER 

CALL  CFILW( "0UT2"  ,  2  ,  IER) 

I F (  IER.NE.  1  )TYPE"CFILW  ERROR"  ,  IER 

CALL  OPEN/  1  , "0UT2 "  ,  2  ,  I ER ) 

IF/  IER.NE.  I  )TYPE"0PEN  FILE  ERROR",  IER 

CALL  WRBLK/  I  ,0 , IDATA3  ,  8  ,  IER) 

IF/  IER.NE.  l)TYPE" WRBLK  ERROR",  IER 


(-'★★I********************************************************* 

c 

C  Title:  AT0D5.FR 

C  Author:  1 Lt  Kathy  R  .  Dixon 

0  Date:  N  o  v  8  4 

C 

C  Function:  performs  A/D  on  F.clipse  for  5  sec 

C 

C  Command  Line: 

C 

C  RLDR/P  2  /  K  A TODS  SAMC0NFIG2  9  S  AML  IB? 

C 


***********] 


r  **★★*★*★*: 


I******************* 


EXTERNAL  IDS21 

EXTERNAL  IDS23 

COMMON  /  IB  L'FF/IDAT  A  3(16384) 

COMMON/ IBUFO/IWAST 

INTEGER  IORBA( 16) .DEVICE 

DEVICE=2 1 
IDATA 1 =6 1 700K 
IDATA2-4000 


[external  input 
[external  output 
; input  data 
;output  data 


device 

device 

buffer 

buffer 


[input  device 
[external  clock 
[conversion  count 


TYPE"<CR> 

start<BELXCR>" 


CALL  DSTRT(IER)  [initialize  A/D  device 

IF( IER . NE . 1 )CALL  ERROR ( " DSTRT  ERROR") 

CALL  DOITW( TORBA ,  IDS2 1  ,8,  IDATA1  ,  IDATA2  ,  I  DATA  3 ,  IER) 
IF( IER.NE. 1 )TYPE"D0IT  ERROR", IER 

TF( IORBA( 1 4) .NE. 40000K)TYPE"IORBA( 1 4)  RETURN" , 

I  OR  BA (  1 4 ) 


TYPE" <CR  > 
stop<BEL><CR>" 

CALL  D  F  T  L  W ( " 0  U  T  5 "  , IER) 

IF( ( IER.NE.  1  )  .AND.  (  IER.NE.  13)  )TYPE"DFILW  ERROR",  IER 

CALL  CFILW(  "Ol’T5"  ,2  ,  TER) 

IF(IER.NE.  I  )TYPE"CFILW  ERROR",  IER 

CALL  OPEN (  1  , "OUT5"  ,  2 , IER  ) 

IF( IER.NE.  1 )TYPE" OPEN  FILE  ERROR",  IER 


CALL  WRBl. K(  1  ,0,  TDATA3  ,  16,  IER) 

IF( IER  .  NE .  1  )TYPE"WRBLK  ERROR",  IER 


*********************************************************** 

★ 


Title  : 
Author 
Date: 


ATOD 1 0 . FR 

ILt  Kathy  R.  Dixon 
Nov  8  4 


Function:  performs  A/D  on  Eclipse  for  10  sec. 

Command  Line: 

RLDR/P  2/K  ATODIO  SAMCONFIG3  'iSAMLIBf? 


C'*********************************************************** 

EXTERNAL  IDS21  jexternal  input  device 

EXTERNAL  I  D  S  2  3  jexternal  output  device 


COMMON/ I  BUFF /IDATA3(  1  6  384) 
COMMON/ IB UFO/ IWAST 


; input  data  buffer 

joutput  data  buffer 


INTEGER  IORBA(  1  6  ),  DEVICE 


DEVICE=2 1 

IDATAl=*61700K 

IDATA2-8000 


jinput  device 
jexternal  clock 

•(Conversion  count 


TYPE" <CR> 
stax  t<BELXCR>" 


CALL  DSTRT(IER) 


; initialize  A/D  device 


IF( IER.NE. 1 )CALL  ERR0R( "DSTRT  ERROR") 


CALL  D0ITW(I0RBA,IDS21  , 8 , IDATA 1  , I  DATA 2 , I D AT A  3 , I ER ) 
IF< IER.NE. l)TYPE"DOIT  ERROR ",IER 

IF ( I0RBA( 14) .NE. 40000K)TYPE"IORBA( 1 4)  RETURN" , 

I  0  R  B  A  (  1  4  ) 


TYPE"<CR> 

stop<BELXCR>" 

CALL  DF I LW ( "OUT  1 0 " , I ER ) 

I F (  IER.NE.  1  .AND.  IER.NE.  13)TYPE"DFILW  ERROR",  I  ER 

CALL  CFILW( "OUT  1 0" , 2  ,  IER  ) 

IF( IER.NE.  1 )TYPE"CFILW  ERROR",  IER 

CALL  OPEN(  1  , "OUT  1  0"  ,  2  ,  I  ER  ) 

IF( IER.NE.  I )TYPE"OPEN  FILE  ERROR",  IER 


CALL  WRBLK(  1  , 0  ,  IDATA3 , 32  ,  IER  ) 

I F (  IER.NE.  1  )TYPE"WRBLK  ERROR" ,  IER 


< 


CALL  CLOSE ( l , I ER ) 

TF( IER. ME.  1)TYPE" CLOSE  ERROR",  IER 

CALL  EXIT 
END 


C  Title:  DISS.FR 

C  Author:  1 Lt  Kathy  Dixon 

0  Date:  N  o  v  8  4 

C 

C  Function:  Prints  distance  matrix  from  a  disk  file 

C 

£rt********************************** 


REAL  DIS(2432) 

CALL  OPEM(  1  ,  "DIST" , 2  , IER) 

I  F (  I  ER  .  NE .  1  )TYPE"OPEN  ERROR", IER 
R  E  A  D (  1  ,100)(DIS(I)  ,1  =  1  ,2432) 
WRITE(12,10l)(DIS(I), 1=1, 2432) 
FORMAT(G 1 1,5) 

FORMAT ( 8G 11.5) 

CALL  C  LO  S  E (  1  ,  IER) 

IF(  IER. NE.  1  )TYPE" CLOSE  ERROR",  IER 

RETURN 

END 


£**************************************★*** 

c 


C  Title:  PHONS.FR 

C  Author:  ILt  Kathy  Dixon 

C  Date:  \'ov  84 

C 

C  Function:  Reads  up  to  70  16-dimensional  vectors 

C  from  a  file,  PHONE  and  prints  them  on 

C  the  line  printer. 

C 


INTEGER  PHO-N  (1130) 

CALL  OPEN( 1 , "PHONE" , 2 , IER) 

IF( IER.NE. 1 )TYPE"0PEN  ERROR", IER 
READ( 1 , 1 00) ( PHON ( I ) , 1= 1 , 1 1 30) 

WR I T  E (  12,  1 0 1 ) ( PHON (  I  )  ,  1  =  1  ,  l  130) 
FORMAT ( 16) 

FORMAT ( 2 X  ,  16  16) 

CALL  C  LO  S  E (  1  , IER) 

I F ( I ER . NE . 1)TYPE" CLOSE  ERROR", IER 
RETURN 


in 


,--*********************************************************** 

c 


r 

C 

r 

C 


Title:  REPS.FR 

Author:  lLt  Kathy  Dixon 

Date:  Nov  8  h 

Function:  Reads  word  phonemes  from  a  file 

VOCABUL . 


INTEGER  VOCAB ( 1 0 , 70 ), NOWORDS 
NOWORDS=70 

CALL  0PEN(  1  , "VOCABUL"  ,  2  ,  IER) 

IF(IER.NE.l) TYPE "OPEN  ERROR", IER 
R  E  A  D (  1  ,  100)(  ( VOCAB (  I  ,J)  ,1=1  ,  10)  ,J=1  .NOWORDS) 
WRITE(12,101)((VOCAB(I,J),I  =  1,  10), J=l, NOWORDS) 
FORMAT ( I  3  ) 

FORMAT ( 1013) 

CALL  CLOSE ( l , I ER ) 

IF(IER.NE.l) TYPE "CLOSE  ERROR", IER 


RETURN 


5»-R15i  936 
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*******'>PR<\fXMC.MC*********************************************** 
Function:  Loads  TRAIN  and  required  subroutines 
**************************************************************** 


RLDR  TRAIN  NEWSCR  TEMPLATE  DISTANCE  PRINTDIS  REDWRDS  REP" 
PRINTREP  SPEECH  REDBUF  NORMALIZE  LOWENERGY  CMPRS1" 
COMPARE  EXTRACT  CMPRS  RECOG  FINDWORD  VTYPE" 

REDD  I S  REDPHON  REDREP  WRTDIS  WRTREP  WRTPHON  ?FLIB<? 


i 


< 


86 


NOS  -  ( 1+ISWITCH)*125  ;  no.  of  vectors  for 

; processing 

LDIS=10.0  ; set  distance  threshold 

DO  76  1=1,250 

SOUND ( I ) =0 

WRITE (12,35)(L1B(I),I*1  ,N0S)  ;option  to  write 

FORMAT(25I3)  ; phoneme  string  to  line  printer 

DO  806  K  = 1  , 5 

DO  809  1  =  1  , ( (  IDATA2/  1 6 ) -  1 ) 

IF(UIB( I  )  . EQ . 0 )G0  TO  807 
IF( LIB( 1+1 ) . EQ . 0 )G0  TO  807 
IF(UIB( I ) . EQ . LIB( 1+1 ) )G0  TO  807 
N=LIB< I ) 

P  =  L I B ( 1  +  1  ) 

I F ( N . GT . P ) GO  TO  810 
Q  =  N 
N»P 
P  =  Q 

IF( DIS( ( (N* (N- 1 ) ) /2 )+P) .GE . LDI S )G0  TO  807 
I  F ( L  I  B (  I  )  .LT.LIBC I+l  )  )LIB(  I  +  1  )=LIB(  I  ) 
IF(LIB(I).GT.LIB(I+1))LJB(I)=LTB(I+1) 

CONTINUE 

CONTINUE 

CONTINUE 

J=  1 

DO  805  1  =  1  ,  ( I  DATA  2 / 1 6) 

IF( ( LIB (  I )  . EQ . 0 )  . AND . ( J . EQ  .  1  )  )G0  TO  805 

IF(  (  T.I  B  (  I  )  .  EQ  .  0  )  .  AND  .  (  LI  B  (  I  -  1  )  .  NE  .  O  )  )GO  TO  805 


£********************************************************** 

C 

C  Title:  COMPARE. FR 


c 

Author:  lLt 

Kat  hy  D i xon 

c 

Nearest  phoneme  code 

derived  from 

c 

CAPT 

Ajmal  Hussain 

c 

r 

Date:  Nov 

84 

c 

Func  t i on : 

c 

Compares 

normalized  data 

vectors.  Finds 

each 

c 

vector'  s 

nearest  vector. 

Averages  two  vectors. 

c 

replaces 

lowest  numbered 

vector  with  new 

vector  , 

C  sets  vector  components  of  second  vector  to  32000 

C 

£******************************★*************************** 

SUBROUTINE  COMPARE ( I  DATA  2 ) 

COMMON/ I  BUFF/ I  DATA ( 8 192) 

INTEGER  I  DATA  5(  500)  , NOVECT ,  IDATA2 , TEMP  4 , IDATA6 
REAL  DIFF( 500) 

DOUBLE  PRECISION  REAL  TEMP, TEMPI 


TEMP  =  0 

TEMP  1 =9 . OE60 

IDATA6=IDATA(  IDATA2  +  1  )*16 

DO  10^.  J=  l  ,  I  D  AT  A  6  ,  1  6 
DO  102  K= 1  , I D  AT  A  6  ,  1 6 

I F ( J . EQ . K)  GO  TO  103 

DO  10  1  L * 0  ,  15 

101  TEMP  =  TEMP+(FLOAT(  I  DATA(  J  +  L)  )-Ft.  OAT  (  I  DATA  (  K  +  L  )  )  )  *  *  4 


I  F (TEMP . GE . TEMP  1  )  GO  TO  103 
TEMP  1 -TEMP 

I  DATA  5 ( INT( ( J+l 5 ) / 1 6 ) )  =  TNT( (K+l 3 ) / 1  ft ) 
D I F  F (  INT( (J+l 5 )/ l 6)  ) =  T  EM  P 
103  TEMP-0 

102  CONTINUE 


TEMP  1  -9 . 0E6O 
CONTINUE 


TEMP=0 


XKK  =  0 

NOVECT=IDATA( IDATA2+1 ) 

DO  107  1=1 .NOVECT 

107  IF(DIFF(I)  . GT . TEMP ) TEMP  =  D I F F ( I  ) 

DO  108  1=1, NOVECT 

108  DIFF( I ) = ( DIFF ( I ) /TEMP ) 


DO  111  1=1, NOVECT 

J=IDATA5 ( I  ) 

TEMP  =  DIFF (  I  ) 

DO  2  J J= 1 , NOVECT 
K  =  0 

TF ( T DATA  5 ( J J )  .EQ.I)TEMPl-DIFF(JJ) 

IF< (TEMP  1  . EQ . 0 )  . OR .  ( TEMP  1  . EQ  .  1 00 ) )GO  TO  2 
I F ( TEMP  1  .LT.TEMP)K  =  JJ 
IF( J . EQ . K)GO  TO  10 
2  CONTINUE 

GO  TO  111 

10  DO  1 5  KL* 1,16 

T  0  AT  A ( ( ( I- 1 ) *1 6 )+KL)=( IDflTA( ( ( I  -  1  ) *  1 6 )+KL) 

*  +IDATA( ( ( J-  1  ) *  1 6 )+KL)  ) /2 

1  ')  TDATA(  (  (J-l  )*16)+KL)  =  32000 

DIFF( J)= 100 

DO  20  JJ J= 1 , NOVECT 

IF((IDATA5(JJJ)  .EQ.  I)  .AND.  (DIFF(JJJ)  .NE.  100)  ) 

*  D I FF ( J  J  J ) =  0 

20  CONTINUE 

1 1 1  CONTINUE 

DO  222  I = 1 , NOVECT 

222  IF(DIFF( I ) . EQ . 1 0 0 ) KK K = KKK + 1 

I  DATA (  I  DATA  2-*- 1  )=IDATA(IDATA2+1  ) - K  K  K 


RETURN 
E 


p*********************************************************** 

c 

C  Title: 

C  Author 

C 

C  Date: 

C 

C  F unc  t ion : 

C  Finds  Minkowski  A  distance  between  phonemes  in 

C  phoneme  template. 

C. 

(-ft********************************************************** 


DISTANCE.FR 
:  Capt.  Ajmal  Hussain 

Modified  by  1  1. 1  Kathy  Dixon 

Nov  84 


SUBROUTINE  D  I  S T A NC E ( P HO N ) 

REAL  DI S ( 2  432 ) 

INTEGER  PHON ( 1130) 

DOUBLE  PRECISION  REAL  TEMP,  TEMPI 

DO  10  1*1,2432  ;zero  distance  matrix 

10  DI S ( I ) -0 . 


TEMP-0 
TEMP  1=0 
I-  1 

DO  31  J- 1  ,  ( PHON(  1  1  2  1  )  *  1 6)  ,  1  6 
DO  32  K- I  , < PH0N( 1 1 2 1 )*  16)  ,  1  6 
IF(K.GT.J)  GO  TO  35 
DO  33  L - 0  ,  15 

3  3  TEMP=TEMP+ ( FLOAT ( PHON (J+L)  )-FL0AT(PH0N(K+L)))**4 

;M-4  calculation 


IFCTEMP. GT. TEMPI )TEMPl=TEMP  ; f ind  largest  distance 

DIS(I)=TEMP  ;store  distance 

1=1+1  [increment  DIS 

35  TEMP  =  0 

32  CONTINUE 

31  CONTINUE 

DO  34  1  =  1  ,  (  ( ( PHON(  1  1 2  1  ) * ( PHON(  1  1  2  1  )  -  1  )  )  /  2  ) 

*  +  P  HO  N (  112  1)) 

3e  DI S ( I )  =  ( ( D I S ( I ) /TEMP  1  ) *  *  0 . 2 5  ) *  1 00  [normalize  DIS  to 


D I S ( 2  4 l 6 )  =  PH0N(  112  1) 


;TEMP 1 

[Store  no.  phonemes 


CALL  WRTDIS(DIS) 


RETURN 

END 


02 


(■*  A********************************************************** 

c 


C  Title:  EXTRACT . FR 

C  Author:  Capt.  Ajmal  Hussain 

C  Modified  by  lLt  Kathy  Dixon 

C  Date:  N  o  v  8  4 

C 

C  Function: 

C  Extracts  phonemes  from  input  speech  file. 

C 

^*********************************************************** 


SUBROUTINE  EXTRACT (IDATA 2, PHON,IENERGY, ENRGY, LIB) 

COMMON/ I  BUFF/ I  DATA ( 4096 ) 

I  NTE.GER  LI  B(  2  50  )  ,  PH0N(  I  I  30  ) 

REAL  I  ENERGY (  5 00  ), ENRGY 
DOUBLE  PRECISION  REAL  TEMP, TEMPI 

DO  1  1-1,250 

1  LIB(I)-0 

TEMP-0 
I  -  1 
M  —  0 

TEMP  1 -9 . 0E60 

DO  87  L- l  ,  I  DAT  A  2  ,  l  6 

IF< IENERGY(  I  )  . LT. ENRGY  )  GO  TO  801  ;cheek  energy 

; threshold 


;Zero  array  to  hold 
[extracted  phonemes 


DO  86  K- 1  ,  ( PHON(  1  1 2 1  )  *  1  6 )  ,  1 6 
DO  85  J-L ,  (L+l  5  ) 

TEMP-TF.MP+(  FLOAT  (  I  DATA  (  J)-PH0N(K  +  M)  )  )  *  *  4  ;m-4 

[distance 

85  M-M+l  [between  template 

IF ( TEMP . GE . TEMP  1  )G0  TO  82  [Vector 

TEMP  1 -TEMP 
LTB(I)-(K+15)/16 

82  TEMP-0 

M-0 


86 
80  1 
87 


CONTINUE 
TEMP  1=9. 0  E  6  0 
1  =  1  +  1 


(-★■A**********-*********************************************** 

c 

C  Title:  FINDWORD.FR 

C  Author:  Capt.  Ajmal  Hussain 

C  Modified  by  1 Lt  Kathy  Dixon 

C  Date:  N'ov  84 

C 

C  Function: 

C  This  routine  compares  a  phoneme  string  with  word 

C  strings  in  a  library  based  upon  a  distance  matrix 

C  to  give  the  word  in  the  library  which  is  the  best 

C  match . 

C 

(-A********************************************************** 

SUBROUTINE  FINDWORD ( PH0N,IPH0N,V0CAB,MAT,TEMP3,L) 

INTEGER  I  ,J,K,L,M,  IPHON(250)  ,LIB(700)  ,  NO WO  R  D 
INTEGER  VOCAB(  1  0 , 70  )  ,  PHON(  1  1  30  ) 

DOUBLE  PRECISION  REAL  TEMP  1  , TEMP , TEMP  3 
REAL  M  AT ( 2432 )  , PEN 

70  TEMP3  =  9.0E  60  initialize  variables 

TEMP  =  n 
COUNT  =  0 
NOWORD-70 
.1=  1 
L  =  0 

no  l  1=1,70 

DO  2  K  =  1  ,  1  0 
I.  TB(J)=VOCAB(K,  I  ) 

2  J=J+1 

1  CONTINUE 

C  start  comparison 

DO  71  M  =  1  ,  (NOWORD*  1  0  )  ,  10  jlibrary,  e.a  c  h  word  a 

;maximum  of  10  phonemes 

DO  72  K  =  -1,1  ; s  h i  f  L  phoneme  string  one  phoneme 

;  1  e  f  t  ,  none  and  one  phoneme  right  to 
;account  for  error  in  first  phoneme 

;  s  t  r i ng 

DO  73  I  =  1,10  ;compare  phoneme  at  a  time  for  each 

;word  in  library 

IF  ((r-t-K).EQ.O)  GO  TO  73  ;  sk  i  p  first  phoneme  when 

;string  shifted  left  one 

9  3 


;  phoneme 


i-'-l 

if  both  phonemes  zero  error  value  unchanged 

IF  ( (LTB(M+I- 1  )  .EQ. 0)  . AND.  ( IPHON( I+K)  . EQ. 0)  )  GO  TO  71 

if  both  phonemes  not  zero  add  distance  between 
phonemes  to  error  value 

IF  (  ( LI B(M+I- 1 ) . NE . 0 )  . AND .  ( IPH0N( I+K)  . NE . 0 )  )  GO  TO  74 

if  one  phoneme  zero  only  add  penalty  to  error  value 

P=LIB(M+I-  1  ) 

PEN-0 

DO  13  5  J J-0  ,  1  5 

PEN  =  PEN  +  ( FL0AT( PHON (  ( P* 1 6  ) - J J  )  )  * *4  ) 

PEN  =((PEN)**0.25) 

TEMPI  -  TEMP+PEN 

GO  TO  75 

N  -  I PHON ( I +  K ) 

P  -  LIB(M+I-1 ) 

I F( N . GE . P  )  GO  TO  7  6 
Q  -  N 
N  =  P 
P  -  Q 

TEMPI  =  MAT (  (  (  N  *  (  N  -  1  ) )  /  2  )  +  P  ) 

TEMP  =  TEMP  +  TEMPI 

COUNT  =  COUNT  +  1 

CONTINUE 

TEMP  «  TEMP/COUNT 

I F ( TEMP . GT . TEMP3 )  GO  TO  77 
TEMP  3  -  TEMP 

L  =  M 

TEMP-0  ;  init  i  al  i  ze  variables 

;  f  o  r  next  word 

COUNT  =  0 

CONTINUE 
CONTINUE 


; add  distance  to 

[average  error  value 
;  and  find  word 
;match  with  minimum 
;error  value 


[find  distance  between 
; phoneme  s  f  r om 
[distance  matrix 


RETURN 


f'*********************************************************** 


Title:  LOW ENERGY . FR 

Author:  Capt.  Ajmal  Hussain 

Modified  by  Kathy  Dixon 
Date:  Nov  8-* 

Function: 

Checks  energy  normalized  vectors  for  energy  less 
than  a  particular  threshold.  Sets  vector  weights 
to  0  . 


SUBROUTINE  LOWENERGY (  IDATA2  ,  IENERGY , ENRGY ,  IWG) 

INTEGER  ID AT A  2  ,  IWG(  500) 

REAL  IENERGY(  500)  ,  ENRGY 

COMMON/ IBUFF/TDATA(8192) 


DO  1  1=1,500 

I WG (  I ) -  1 


DO  2  L= 1  ,  ( I D  A T  A  2 /  1  6  ) 

I F ( I  ENERGY ( L ). GE . ENRGY )  GO  TO  2 
IWG(L) =0 
K  =  K  +  1 
CONTINUE 


;cheek  phoneme 
; energy 


ITATA(  IDATA2+ 1  )  =  (IDATA2/16)-K 


RETURN 


;  store  no.  of 
REMAINING  VECTORS 


f'*********************************************************** 


C  Title:  REP . FR 

C  Author:  lit  Kathy  R.  Dixon 

C  Date:  X  o  v  8  4 

C 

C  Function: 

C  Creates  phoneme  representation  of  a  word. 

C  Stores  representation  in  array,  SOUND. 

r 

(~*********************************************************** 

SUBROUTINE  REP( ENRGY , PHON , D IS , SOUND) 

COMMON/IB  UFF/IDAT  A  (AflSfi) 


INTEGER  P  H  0  N (  1  ISO)  ,S0UND(250)  , L  I  B ( 2  5  0  )  ,  I  D  AT  A  2 

INTEGER  T  SWITCH  ,  I  STOP  ,  I  F  I  T.E 

REAL  D  I  S  (  2  4  3  2  ) , TENERGY ( 500) , ENRGY 

T  D  AT  A  2=  1  600  ;  conversion  count 

I  S  T  O  P  =  8  ;last  block  to  read 

TFT  I.  F.  =  2  ;  file  to  read 

ISWITCH=0  ; switch  to  reduce  input 

, phoneme  string  to  max  10 
CALL  REDBUF (  I  FI LE ,  I  STOP  )  ;read  data  from  file 

CALL  NORMALT7.EC  TDATA2  ,  TENERGY)  .normalize  vectors 


CALL  EXTRACT (  I  DAT A  2  ,  PHON ,  TENERGY  ,  ENRGY , LI  B)  ;  ex  t  r  ac  t 

; phonemes 

CALL  CMPRS(  ID AT A  2  ,  PHON ,  DIS  ,  LTB  ,  ISWTTCH  ,  SOUND  ,  J) 

;  compress  phoneme  string 

RETURN 

END 


C  Title:  REDWRDS . FR 

r  Author:  ILt  Kathy  R.  Dixon 

C  Date:  N  o  v  8  4 

r 

C  F  mi''  Lion: 

C  Reads  vocabular-  words  from  a  file,  WORDS. 

C 

{-•  *********************************************************** 


SUBROUTINE  R  E  D  WR  D  S  (  MO  WO  R  D  S  .WORD) 

INTEGER  WORD ( 7 , 70 ) , NOWORDS 

CALL  0PEN(  !  , "WORDS"  , 2  ,  IER  ) 

TF(  IER.NE.  1  )TYPE"OPEN  ERROR",  TF.R 
R  E  A  D (  1  ,  1 0  0 ) ( WO  R  D (  1  ,  I)  ,  1  =  1  .MOWORDS) 
100  FORMAT ( S 1 i  ) 

CALL  CLOSE (  1  ,  IER) 

TF(TER.NE. 1 )TYPF" CLOSE  ERROR", IER 


RETURN 


(-***★***★********************★****************************** 

c 

C  Title:  REDPHON . FR 

C  Author:  Kathy  Dixon 

C  Date:  Nov  8  A 

C 

C  Function:  Reads  70  16-dimensional  vectors 

C  from  a  file  PHONE. 

C 

P*********************************************************** 


SUBROUTINE  R E D P HON ( P H 0 N  ) 

INTEGER  P  HO  N (  1130) 

CALL  0  P  E  N  (  1  , "PHONE"  , 2  ,  IER) 

I F (  IER.NE.  1  )TYPE" OPEN  ERROR”, IER 
R  E  A  D (  l  ,  100)(PHON(I)  ,  1  =  1  ,  l  130) 

100  FORMAT(I6) 

CALL  CLOSE ( 1 , I ER ) 

I  F  (  TER  .  NF.  .  1  )  TYPE  "CLOSE  ERROR", IER 

RETURN 

END 


I 


I 


> 


» 


» 


» 


1  0  8 


l 


£******************************** 

c 


C  Title:  RF.DDI  S  .  FR  • 

C  Author:  ILt  Kathy  Dixon 

C  Date:  Mov  84  ... 

C 

C  Function:  Reads  distance  matrix  from  a  disk  file. 

C 

(-**•*****★***★********************************************★**  0 

SUBROUTINE  REDDIS(DIS) 

REAL  DIS(2432) 

CALL  OPEN(  1  , "DIST"  ,  2  ,  IER)  • 

IF(IER.NE.  1  )TYPE"OPEN  ERROR”,  IER 
R E A D (  1  ,  100) (DIS (  I  )  ,  1=1  ,2432) 

100  FORMAT ( G  1  1.5) 

CALL  C  L0  S  E (  1  ,  IER) 

TF(IER.NE.  1  )TYPE" CLOSE  ERROR", IER 

RETURN 

END 


S  =  1 

GO  TO  172 

C  CO  NT  IN  COL'S  RECOGNITION 

1_7  V  =  1 

W  =  l 
X=  1 

y=  l 

F  L  A  G  =  0 

17!  DO  176  I-  1  ,  1  0 

176  TWOR D ( I  )  =  0 

TEMP  3  >*  9 . 0E60 

DO  173  1-1,10 
TWOR  D ( I ) =WRD ( V ) 

V  =  V+1 

I F ( I . LE . 2 ) GO  TO  173 

CALL  F I NDWORD ( PH ON , TWORD , VOCAB , D I S , TEMP , L ) 

IF( Y . LE  .  1  )GO  TO  180 

IF( (L.EQ.REJC (Y-l  ),  (Y-l  )))  .AND.  (FLAG. EQ  .  1  )  )GO  TO  186 

180  IF(TEMP.GE.TEMP3)GO  TO  173 

TEMP  3  =  TEMP 

T=L 

U-  I 

GO  TO  173 
186  F  L  AG  =  0 

173  CONTINUE 

RE J ( Y  ,  X ) =T 

TOT ( Y )=TOT(Y) +TEMP 3 

X  =  X+  1 

IF( (W  +  U) . LT . S )GO  TO  174 

IF( Y . LE . 4 )GO  TO  179 

TOT ( Y ) =TOT ( Y  )  /X 
S  =  1 

TEMP  3  =  9 . 0E60 
DO  18  1  I  =  1  ,  Y 

I  F  (  TOT(  I  )  .  GE  .  TF.MP3  )GO  TO  181 
TEMP3=TOT(  I  ) 

Z  =  I 

181  CONTINUE 


1  04 


u»  ro 


c 

c 

c 

c 

c 

c 

c 


*********************************************************** 

Title:  REC.OG. FR 

Author:  Capt  Ajmal  Hussain 

Modified  by  lLt  Kathy  Dixon 
Date:  Nov  84 


C  Function: 

C  Performs  continuous  speech  recognition.  Prints 

C  recognized  string  on  video  screen  and  line  printer 

C 


SUBROUTINE  RECOG(PHON,VOCAB,WORD,DIS,LIB,J) 

INTEGER  LIB( 250)  , VOCAB( 1 0 , 70)  ,  J , K , I  , L,M,N , P , Q , R , S  ,T 
INTEGER  U,V,W,X,Y,Z, TW0RD(  10)  ,WRD(250)  ,REJ(  10,  10) 
INTEGER  FLAG ,WORD(7,70),PHON(1130) 

REAL  DIS(  2432  )  ,T0T( 5  ) 

DOUBLE  PRECISION  REAL  TEMP , TEMP  1  , TEMP  3 

LEN 1-2 
LEN2-9 

DO  188  1=1,70 
DO  188  K=  1  .  1  0 
188  REJ(K,I)=0 

DO  187  1-1,5 

187  T0T( I ) - 0 

R  -  1 
S-  1 

DO  175  1=1,250 
WRD (  I  ) =0 

79  IF(R . GE . ( J+l ) )G0  TO  1000 

IF( LIB(R) . EQ . 0)GO  TO  170 
WRD(S)=LIB(R) 

R-R+l 
S  =  S+  1 
GO  TO  79 

170  I F ( S . GT . LEN 1 )G0  TO  178 

R-R+l 
GO  TO  79 


178  IF ( S . GT . LEN2 )CO  TO  177 

CALL  FINDWORD  (  PHON  ,  WR D  ,  VOC A  B  ,  D  I  S  ,  TEMP  ,  I.  ) 
CALL  VTYPE ( L , WORD ) 


1  0  3 


Title:  PRINTREP.FR 

Aithor:  1  L  t  Kathy  R.  Dixon 

Date:  Nov  84 

Function  : 

Prints  table  of  vocabulary  words  and  associated 
phoneme  r e p r e s en t a i t i on 


SUBROUTINE  PRINTREP(WORD,VOCAB, NOWORDS ) 

INTEGER  W0RD(7 ,70)  ,V0CAB(  10,70)  ,  NOWORDS 

WR I TE (  12,  100) (W0RD(  1  ,  I)  ,(V0CAB(J,  I)  ,  J  =  1  ,  10)  , 
*  1=1, NOWORDS ) 

FORMAT ( 2X , S14, 10X, 1014) 


RETURN 

END 


K  =  8  2  0 

DO  502  1  =  4  1  ,  DIS ( 24 1 6 ) 

WR I TE ( 12,58) 

WRITE(  12 , 54)  I 
DO  503  J=1 , 40 

WRITE (12, 54)<I*T((DIS(((I*(I-1>)/2)-KJ>))) 
50  3  CONTINl.'E 

502  CONTINUE 

WR ITE ( 12,59) 

WRITE ( 1 2 , 5 1 ) IOP 
WR ITE (  12,58) 

WR I TE ( 12,58) 

WRITE<  12,55) 

DO  505  1  =  4  1  , D  I  S ( 2  4 1  6) 

WR  I T  E (  12,54)1 

505  CONTINUE 

DO  506  1=41 ,DIS(2416) 

WR I TE ( 12,58) 

WR ITE (  12,54)1 

DO  507  J  =  4 1  ,DIS(2416) 

IF( J.GT. I ) GO  TO  507 

WR ITE(12,54)(INT((DIS(((I*(I-l))/2)+J)))) 
507  CONTINUE 

506  CONTINUE 

500  RETURN 

END 


i 


I 


101 


SUBROUTINE  P R I NTD I S ! D I S ) 

REAL  DIS(2432) 

IOP=  1 

L=DIS!2416) 

WR ITE ( 12,51) IOP 
FORMAT( 5  0  X , "MAT"  ,12) 

WR I T  E (  12,58) 

IF!DI S ( 24 1 6 ) . LE . 40 )CO  TO  52 
L-40 

WR I TE ( 12,55) 

FORMAT! "  +  "  , 3X , Z  ) 

DO  53  1*1 ,L 
WRITE! 12,54)  I 
CONTINUE 

FORMAT! "  +  "  ,  I  3  , Z  ) 

K  *  1 

DO  5  6  I  -  1  ,  L 
WRITE! 12,58) 

WRITE! 12,54)  I 

DO  57  J  = 1  , L 

IF! J.GT. I ) GO  TO  57 

WR I TE !  1 2  , 5  4 ) ! I  NT ! D I S ! K  )  )  ) 

K  =  K+l 
CONTINUE 
CONTINUE 
FORMAT!  1  X) 

IF!DIS! 24 1 6) . LE . 40)00  TO  500 


WRITE! 1 2 , 59) 
FORMAT! " 1 " ) 
WRITE!  1 2  ,  5  t  )  TOP 
WRITE! 12,58) 

WR ITE( I  2 , 58  ) 

WR ITE !  I  2  ,  5  5  ) 

DO  501  1=1,40 

WR ITE!  1 2  ,  5  4  )  I 
CONTINUE 


> 


P*********************************************************** 

c 

C  Title:  MO  RM  A  L  1  Z  F.  .  FR 

C  Author:  C  a  p  t  .  Ajmal  Hussain 

C  Modified  by  Kathy  Dixon 

C  Date;  N'  o  v  8  4 

C 

C  Func  t i on  : 

C  Normalizes  sixteen  dimensional  vectors 

C  of  digitized  speech. 

C 

(^*******************************************************-**** 

SUBROUTINE  NORMAL  I ZE(  I  DAT A 2 ,  I  ENERGY) 

COMMON/ I  BUFF/ I  DATA ( 8 19  2) 


> 


> 


k 


5 


INTEGER  IDATA2 
REAL  I  ENERGY (  500) 

DOUBLE  PRECISION  REAL  TEMP 

TEMP  *  0 
K«  1 
J*  1 
L«  1 


DO  5  1= 1  ,  ( I D  AT  A  2 /  1  6  ) 

TEMP  »  0 


DO  2  J  =  K ,  <  K+ 1 5  ) 

TEMP  =  TEMP  +  FLOAT( IDATA(J)  )  *  *  2  [energy  calculation 

CONTINUE 


TEMP  = ( SQRT ( TEMP )/ 32000 ) 
IENERGY( I )=ABS(TEMP) 

DO  4  J-K , ( K+l 5 ) 

IDATA(J)=FLOAT( IDATA ( J ) ) /TEMP 
K  =  K  +  l  6 


;store  energy  values 

[energy  normalize 
; vectors 


RETURN 

END 


I 


99 


1-*********************************************************** 

c 

C  Title:  NewScr 

C  Author:  Lt  Allen 

C  Date:  Dec  82 

C 

C  Function: 

C  This  routine  erases  the  screen  by  typing  24 

C  blank  lines. 

C 

C  Compile  command: 

C  FORTRAN  NEWSCR 


:★*★★★*********★**★*•**★*** 


SUBROUTINE  NEWSCR 

DO  10  1=1,24 
TYPE 

10  CONTINUE 


RETURN 


£**************************1 


:****************> 


£*********************************************************** 

c 

C  Title:  SPEECH . FR 

C  Author:  lLt  Kathy  R.  Dixon 

C  Date:  N  o  v  8  4 

C 

C  F unc  t  i  on  : 

C  Calls  routines  to  recognize  continuous  speech. 

C 

p*********************************************************** 


SUBROUTINE  SPEECH<ENRGY, PHON, DIS.VOCAB, WORD) 

COMMON/ I  BUFF/ IDATA(4096) 

INTEGER  PHON (  1  130)  ,VOCAB(  10,70)  ,  LI B (  2  50  )  ,  S0UND(  2  5  0  ) 
INTEGER  ISTOP,  IFILE,  IDATA2  ,WORD(  7  ,  70)  ,  T  SWITCH  ,  .) 

REAL  DIS(2432)  ,  IENERGY(500)  , ENRG Y 


ID AT A  2  =  4000 
I STO  P  = 1 6 
I  F  I  L  E  =  5 
I S W I TC  H  = 1 

CALL  REDBUF( IFILE , ISTOP)  [read  data 

CALL  N0RMALIZE(I0ATA2,IENERCY)  [normalize  vectors 

CALL  EXTRACT(IDATA2,PHON,IENERGY,ENRGY,LIB) 

jextract  phonemes 

CALL  CMPBS(IDATA2,PH0N,DIS,LTB,TSWITCH,S0UND,J) 

;compress  phonemes 

CALL  REC0G( PHON , VOCAB , WORD , D I S , SOUND  ,  J  ) 

[determine  words  in 
[input  string 

RETURN  [print  to  screen 

END 


[conversion  value 
[last  block  to  read 
[file  to  read 
[compress  input  string  once 


.  • 


1  I  2 


c 

C  Title :  TEMPLATE . FR 

C  Author:  lLt  Kathy  Dixon 

C  Date:  Nov  ft  4 

C 

C  Function: 

C  Performs  necessary  calls  to  support  routines 

C  to  produce  a  phoneme  template  from  500  vectors 

C  of  input  speech.  The  template  is  stored  in  the 

C  sequentially  in  the  array,  -PHON. 

C 

P*********************************************************** 

SUBROUTINE  TEMP L AT E ( EN RG Y ) 

INTEGER  PHON (  1  130)  , MOVECT , I  DATA! 

INTEGER  WEICHT(  500  )  ,  IWGHT(  500)  ,  I WG HTT (  500  ) 

REAL  IENERGY(  500)  ,ENRGY 

COMMON/ I  BUFF/ I  DATA ( 8 1 9  2 ) 

NOVECT=500 
IDATA2=8000 
I STO  P  =  3  2 
I  F I  L  E  =  1  0 
ENRG Y  =  0 

CALL  REDBUF ( IFILE  ,  ISTOP)  ;read  data  into  buffer 

CALL  NORMAL  I ZE ( I DATA2  ,  I ENERG Y  )  ;normalite  vectors 


;  conversion  count 
;no.  blocks  to  read 

;file  to  read 


C 

3  -•  7 

88 


1 

09 


WRITE (  1 2 , 3 47 ) ( IENERGY ( I  )  ,  1  =  1  ,  500  )  ;  option  to  write 

F0RMAT(10Gll.5)  ;  energy  values  to  line 

;pr inter 

DO  88  1=1,500 

ENRGY  =  ENRC,Y+IENERGY(  I  ) 

ENRGY= ( ENRCY/ 500 ) 

CALL  LOWENERGY  (  I  DATA  2  ,  I  ENERGY  ,  ENRG  Y  ,  WF.  I  GHT  ) 

; low  energy  vectors 

TYPE "Calculating  template" 

CALL  CMPRS1 (TDATA2, WEIGHT, IWG HTT)  ;delete  vectors 

DO  99  1=1,500 

WEIGHT ( I ) = I WGHTT ( T ) 

N  0  V  E  C  T  =  I  D  A  T  A  (  I  D  A  T  A  2  +■  1  ) 


1  1  3 


,  NOVECT 


TYPE "Number  of  remaining  vectors 
tF(  NOVECT.LT.  7  0)00  TO  2 

CALL  COMPAREf IDATA2 .WEIGHT, IWGHT) 

[compare  and  average 

DO  44  1-1,500 
WEIGHT( I )  =  IWGHT(  I  ) 

GO  TO  1  [nearest  vectors 

DO  5  1-1,1130  ;zero  phoneme  array 

PHON ( I ) =0 

DO  3  1=1,1120  jstore  phoneme  template 

PH0N ( X )  =  I  DAT  A (  I  ) 

PH0N(112l)=N0VECT  [Store  no.  of  phonemes 

CALL  WRTPHON(PHON) 


RETURN 


p********K************************************************** 

r  * 

C  Title:  TRAIN*.  FR  * 

C  Author:  lit  Kathy  R.  Dixon  * 

C  Date:  J  u  1  8  4  * 

C  * 

C  Function:  * 

C  This  routine  drives  the  system  for  training  * 

C  and  speech  recognition  for  a  specified  user.  * 

C  * 

p******************************************************** 


INTEGER  PHON(  1  1 3  0 )  , SOUND ( 250 )  , VOCAB(  1  0 , 70 ) 
INTEGER  WO  R  D ( 7 ,  7  0 ) 

REAL  DIS(2432)  .ENRGY 


CALL  NEWSCR 


:  e  r  a  s  e  screen 


TYPE" <CR> 

Welcome  to  the  A  FIT  Speech  Recognition  Project  .  <*CR> 

<  CR  > 

Using  AFIT  studies  in  speech  research,  a  machine<CR> 
was  developed  to  recognize  continuous  speech.  <CR? 
Though  training  is  required,  the  ultimate  goal  <CR> 
of  this  work  is  to  develop  a  machine  which  is<CR> 
speaker  i nd e p ende n t . < CR > 

<  CR  > 

Continue?  [Y]" 

CALL  GCHAR( ICHAR , IER) 

IF( ICHAR.EQ. 78)GO  TO  111 

CALL  NEWSCR 

TYPE" <CR> 

Press  CR ,  then  say  the  f o 1 1 o w i n g < C R > 
phrases  into  the  microphone  after<CR> 
prompted  by  the  word,  start . <CR> 

Only  10  seconds  of  speech  w i 1 1 <CR> 
be  ac c e p t ed . <CR > 

<*  C  R  > 

^CR> 

CHANGE  FREQUENCY  TO  THREE  FIVE  SEVEN<CR> 

LOCK-ON  TARGET  AT  TWO  THOUS  AND  MARK  <CR  > 

ARM  STATION  ALFA  BRAVO  CHARLIE  F0XTR0T<CR> 

MAP  AIR-TO-SURFACE  MISSLE  THREAT" 


CALL  GCHAR( ICHAR , TER) 

CALL  SWAP( "ATOD 1 0 . SV" , TER ) 


; s  w  a  p  for  A/D 


O  O  ^  V  '  ■  ’  . 


_ L.  ’  m  Af'.  •'»  *  .  '  j 


TF(IER.NE.1)TYPE"SWAP  error  ",TER 


; 1 0  sec  speech 


CALL  NEWSCR 


CALL  TEMPLATE ( ENRGY ) 


CALL  NEWSCR 


;  create  phoneme  Lemplate 


TYPE "Calculating  distance  matrix" 


CALL  GCHAR( ICHAR , IER) 


CALL  REDPHON(PHON) 


CALL  DISTANCE( PHON) 


;create  distance  matrix 


CALL  REDDIS(DIS) 


CALL  NEWSCR 


TYPE "Printing  distance  matrix" 


CALL  GCHAR( ICHAR , IER) 


CALL  PRINTDIS(DIS) 


jprint  distance  matrix 


CALL  NEWSCR 


DO  15  1=1,70 

DO  15  J=  1  ,  1  0 
VOCAB ( J , I ) =0 


zero  VOCAR  matrix 


NOWORDS=70 


;no.  vocabulary  words 


CALL  REDWRDS (NOWORDS , WORD ) 

TYPE" <CR> 

Press  CR,  a  word  will  appear<CR> 
on  the  screen.  Press  CR  again<CR> 
then  say  the  word.  Repeat.  " 

CALL  GCHAR (  ICHAR ,  I ER ) 

DO  99  11=1, NOWORDS 

CALL  NEWSCR 


; r  e  a  d  vocabulary 
ifrom  file 


WR TTE ( 1 0 , 1 00 )  WORD (1,11) 
FO  R  M AT ( 2X , S 1 4) 


write  word  to  screen 


CALL  SWAP ( " AT0D2 . S V"  ,  I ER  ) 


; s  w  a  p  for  A/D 


9  J 


1  1  5 


TF(  IF.R  .  NF.  .  1  )TYPE  "SWAP  error", IER  ;3  sec  speech 

CALL  REP(ENRGY , PHON  ,  DIS  , SOUND) 

DO  6  3  K= 1  ,  I  0 

63  VOCAB ( K ,  I  I )  =  SOUND(K)  ;store  word  rep 

9  9  CONTINUE 

CALL  NEWSCR 

TYPE "Printing  word  phoneme  representations" 

CALL  WRTREP( VOCAB , NOWORDS ) 

CALL  PRINTREP(WORD .VOCAB .NOWORDS)  ;print  word  rep 

CALL  REDREP(NOWORDS , VOCAB) 

CALL  NEWSCR 

£-*********************************************************** 


TYPE" <CR> 

Training  is  complete.  Testing  ^CR> 
can  now  be  done.  An  asterisk  will<CR> 
appear  on  the  screen.  Press  CR  ,  <CR> 
then  say  the  test  word.  The  machine  '  s<T,R'< 
guess  will  then  be  typed  on  the  screen. <CR> 
<CR'> 

Continue’  [Y]" 

CALL  GCHAR ( ICHAR  ,  IER  ) 

IF(TCHAR.EQ.78)GO  TO  111 

CALL  NEWSCR 


TYPE"*" 


;  request  speech  input 


CALL  GCHAR(  ICHAR,  IER) 

CALL  SWAP ( " AT0D5 . SV"  ,  IER)  ;swap  for  A/D 

IF( IER.NE.  1  )TYPE"SVAP  ERROR",  IER  ;3  sec  speech 


CALL  SPEECH ( ENRGY , PHON , D I S , VOCAB , WOR D )  jspeech 

[recognition 


1  1  6 


TYPE"<CR> 

*  Continue?  [Y]  " 

CALL  CCHAR( ICHAR,  IER) 

IF( ICHAR . EQ . 78  )  GO  TO  73 

GO  TO  7-7 

73  CALL  XEWSCR 

TYPE"<CR> 

*  This  concludes  the  session.  <CR> 

*  End  session0  [Y]  " 

CALL  GCHAR ( ICHAR , IER  ) 

IF(  ICHAR . EQ  .  78  )  QO  TO  111 
GO  TO  112 

111  TYPE" <CR> 

*  Start  again?  [N]  " 

CALL  GCHAR( ICHAR  ,  IER) 

I F ( ICHAR . EQ . 89 )  GO  TO  1 

1 1 2  CALL  EXIT 
END 


******************************************************** 

Title:  VTYPE.FR 
Author:  ILt  Kathy  Dixon 

Modelled  after  Hussain's  WTYPE 
Date:  Nov  84 

Function : 

This  routine  displays  the  word  specified  on  the 
H19  terminal  in  video  on  a  single  line. 

******************************************************** 

SUBROUTINE  VT Y P E ( L , WO R D ) 

INTEGER  L,WORD(7,7G) 

TF(L.EQ.O)  GO  TO  10  ; ski p  zero  numbered  words 

L=  <  L- t  ) / 1 0+ l 
WR I TE (  1  2  ,  1  5  )L 
WRITE(12,14)WORD(t,L) 

WRITE( 1 0 , 1 1 )WORD( 1 , L)  [display  word 

1  1  FORMAT ( S 1 4  ) 

U  F0RMAT( , S 1 4 , Z) 

15  FORMAT(2X,  14) 

10  RETURN 

END 

f<*********************************************************** 


p  *  *  * 

c 

c 

c 

0 

c 

c 

c 

c 

c 

c 

c 

(*  *  *  * 


1  l  8 


C  Title:  WRTD I S . FR 

C  Author:  Kathy  Dixon 

C  Date:  N  o  v  8  4 

C 

0  Function:  Writes  Distance  matrix  to  a  disk  file. 

C 

(-***************'*************************'*'****************** 

SUBROUTINE  WRTDIS(DIS) 

REAL  DTS(2432) 

CALL  CFILW("DIST"  ,2  ,  IER) 

IF(  IER.NE.  1  )TYPE"CFILW  ERROR",  IER 
CALL  OPEN( 1 , "DIST" , 2 , IER) 

IF( IER.NE. 1 ) TYPE "OPEN  ERROR ".TER 
WR  I T  E (  1  ,100)(DIS(I),I  =  1  ,2432) 

1  00  FORMAT(C,  1  1.5) 

CALL  C  LO  S  E (  1  ,  IER) 

IF( TER . NE . 1 ) TYPE "CLOSE  ERR OR", IER 

RETURN 

END 


t-  ■-  1  .■  ■  -r  I  ■  1  -  i"  r 


C_  *  * 

c 

c 

c 

c 

c 

c 

c 

c. 


C  *  * 


**************** 


************** 


Title:  WRTPHON.FR 

Author:  Kathy  Dixon 

Date:  Nov  BA 

Function:  Writes  TO  16-dimensional  vectors  to 

a  file,  PHONE 

********************************************************* 


SUBROUTINE  WRT PHON ( PHO N ) 

T  NTEGER  PHON (1130) 

CALL  CFILW( "PHONE" .  2  ,  IER) 

I F (  I ER . NE .  1  )TYPE"WRTPH0N  CFTLW  ERROR",  IER 
CALL  0PEN(  1  , "PHONE" , 2 ,  IER  ) 

IF(IER.NE.  1  )TYPE"WRTPHON  ERROR",  TER 
WR  ITE (  1  ,  100)(PHON(I)  ,  I “ 1  ,  l  13  0) 

100  FORMAT( 16) 

CALL  CLOSE(l.IER) 

IF(  IER.NE.  1  )TYPE" CLOSE  ERROR"  ,  IER 

RETURN 

END 


1  2  0 


.%  /.  /.  , 


•  «  •  •  •  ■ _  •  c. •  .  « 


<~*********************************************************** 


Title:  WRTREP . FR 

\uthor:  Kathy  Dixon 
Date:  Nov  8  4 

Function:  Writes  word  phonemes  to  a  file,  VOCABUL. 


C 
r 
C 

r 
c 
c 


SUBROUTINE  WRTREP( VOCAB.NO WORDS) 

INTEGER  VOCAB (  1 O  ,  70 ), NOWORDS 

CALL  CF I LW( " VOCABUL" ,2 , IER) 

IF(TFR.NE. 1 )TYPE"CFILW  ERROR", IER 
CALL  OPEN (  l  ,  "VOCABUL"  , 2  ,  IER ) 

TF( IER.NE. 1 )TYPE"0PEN  ERROR", IER 
WR I TE (  t  ,  1  00  )  ( ( VOCAB (  I  ,J)  ,1=1  ,  10)  ,J=1  , NO WORDS ) 
lOO  FORMAT (II) 

CALL  C  LO  S  E (  1  ,  IER) 

IF(TER.NF. 1 ) TYPE "CLOSE  ERROR", IER 


RETURN 


r-*********************************************************** 

r. 

C  Title:  VOCAB.FK 

C  Author:  lLt  Kathy  Dixon 

C  Date:  M  o  v  8  4 

C 

C.  Function  : 

C  Creates  a  file  of  English  words. 

C 

C  Command  I.  i  ne  : 

C  RLDR  VOCAB  NEWSCR 

C 

p*********************************************************** 

INTEGER  W( 7 , 70  ) , L 

ACCEPT "NUMBER  OF  VOCABULARY  WORDS  ",L 

type"  *  *  *  *  *' 

DO  200  I = 1 , L 
ACCEPT "WORD  " 

R E A D (  1  1  ,  1 00 )W(  1  ,  I ) 

100  FORMAT ( S 1 4 ) 

200  CONTINUE 

ACCEPT "PRINT  VOCABULARY  [Y]" 

CALL  GCHAR ( ICHAR , I ER ) 

IF(ICHAR.EQ.78)G0  TO  111 

CALL  NEWSCR 

WRTTE( 10 , 1 50) (W( 1 ,I),I«1 , L) 

WR I TE ( 12,  1 5  0 ) ( W(  l  ,  I) ,  1*1  ,L) 

150  FORMAT( 2X , 5S 1 4) 

CALL  CFILW(  "WORDS"  ,  2  ,  I  F.R  ) 

IF(IER.NE.  t  )TYPE"CFTI.W  ERROR",  1  ER 

CALL  OPEN(  1  ."WORDS"  ,2  ,  TER) 

IF(TER.NE.  1  )TYPE"OPEN  ERROR",  I  F.R 
WRITE(1,100)(W(I,I), 1-1,70) 

CALL  CLOSE (  1  ,  I  ER  ) 

I F ( IER  .  NE .  1)TYPE" CLOSE  ERROR",  IER 

1 1 1  CALL  EXIT 

END 


VITA 

Kathv  Renee  Dixon  was  born  on  2  7  January  19  5  9  in 
Roswell,  N'ew  Mexico.  She  graduated  form  high  school  in 

Orlando,  Florida  in  1977  and  attended  the  University  of 

Central  Florida,  Orlando,  Florida  from  which  she  received 
the  degree  of  Bachelor  of  Science  in  Engineering  in  April 
1982.  She  entered  the  Air  Force  on  active  duty  in  May  1982 
and  received  her  commission  from  Officer  Training  School  in 

August  1982.  She  served  as  a  Project  Engineer  in  the  Signal 
Processing  Laboratory,  Air  Force  Institute  of  Technology 
until  entering  the  School  of  Engineering  in  June  1983. 

Permanent  address:  520  Sheppard  Road 

Orlando,  Florida  32820 


IFlCATION  OF  THIS  PAGE 


/]  D  '  A\  b  1 


REPORT  DOCUMENTATION  PAGE 


lb.  RESTRICTIVE  MARKINGS 


URITV  CLASSIFICATION 

aA  ^sir  i  rr  d 


lassification  authority 


CATION/DOWNGHAOING  SCHEDU L£ 


ORGANIZATION  REPORT  NUM86RIS) 

'eng/34d-26 


AFORMING  ORGANIZATION  6b.  OFFICE  SYMBOL  7a  NAME  OF  MONITORING  ORGANIZATION 

(If  applicable) 

7f  Engineering  AFIT/ENG 


ity.  State  and  ZIP  Code ) 


3.  DISTRIBUTION/AVAILABILITY  of  report 

Approved  for  public  release, 
distribution  unlimited 


5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 


7b.  ADDRESS  (City.  State  and  ZIP  Code) 


:e  Institute  of  Technology 
Patterson  AF3,  Ohio  45433 


jnoing/sfonsoring 

rioN 


Bb.  OFFICE  SYMBOL  9.  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 
(If  applicable) 


de  Security  Classification) 


10.  SOURCE  OF  FUNDING  NOS. 

PROGRAM  1 

PROJECT 

TASK 

WORK  UNIT 

ELEMENT  NO.  | 

NO. 

NO. 

NO. 

AUTHOR(S) 

TV 


s  r  l 


13b.  TIME  COVERED 

from _  to 


15.  PAGE  COUNT 

135 


COSATI  COOES 


SUB.  GR 


18.  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

Recognition,  Speech,  Phonemes,  Speech  Recognition, 
Continuous  Speech  Recognition 


i Continue  on  reverse  if  necessary  and  identify  by  block  number i 


IMPLEMENTATION  OP  A  REAL-TIME, 
INTERACTIVE,  CONTINUOUS  SPEECH 
RECOGNITION  SYSTEM 


. cairman :  Dr.  Matthew  Kabriskv,  Professor  of  Electrical  Engineering, 
Air  Force  Institute  of  Technology 


,  '  i  ■  f* 


ION/A  VAILABILITY  OF  ABSTRACT 

l/UN  LIMITED}^  Same  AS  RFT  d  OTIC  USERS  □ 


21  ABSTRACT  SECURITY  CLASSIFICATION 

UNCLASSIFIED 


»73,  83  APR 


EOITION  OF  I  JAN  73  IS  OBSOLETE. 


